`
文章列表
    go through this block codes below,we will figure out some conclusions: val barr1 = sc.broadcast(arr1) //-broadcast a array with 1M int elements //-this is a embedded broadcast wrapped by rdd below.so this data val observedSizes = sc.parallelize(1 to 10, slices).map(_ => barr1 ...
   access pattern in spark storage         [1] 到目前为止,我们已经了解了spark怎么使用JVM的内存以及集群上执行槽是什么,目前为止还没有谈到task的一些细节,这将在另一个文章中提高,基本上就是spark的一个工作单元,作 ...

spark-hive on spark

总体设计 Hive on Spark总体的设计思路是,尽可能重用Hive逻辑层面的功能;从生成物理计划开始,提供一整套针对Spark的实现,比如SparkCompiler、SparkTask等,这样Hive的查询就可以作为Spark的任务来执行了。以下是几点主要的设计原则。 尽可能减少对Hive原有代码的修改。这是和之前的Shark设计思路最大的不同。Shark对Hive的改动太大以至于无法被Hive社区接受,Hive on Spark尽可能少改动Hive的代码,从而不影响Hive目前对MapReduce和Tez的支持。同时,Hive on Spark保证对现有的MapReduc ...
  In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure, offers high-level and domain specific operations, saves space, and executes at superior speeds. ...
     yep from [1] we know that spark will divide jobs into two steps to be executed:a.launches executors and b.assigns tasks to that executors by driver.so how do executors are assigned to workers by master is very important!   for standalone mode,when we dive into the src in Master#receiveWithLo ...
  below code path are all from sparks' example beside some comments are added by me.   val lines = ctx.textFile(args(0), 1) //-1 generate links of <src,targets> pair var links = lines.map{ s => val parts = s.split("\\s+") (parts(0), parts(1)) //-pair of ...
   same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need to recomputate the  rdd against the job.   in fact,the checkpoint op will cut down the relationships ...
  there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark  implement them.   1.abstract(functions in RDD) group function feature  principle     1 first()  retrieve the first element in this rdd,if it's more than one partitons,the ...
  there are several component entities run as daemons in spark(standalone),know to what/how they are working is necessary indeed.     akka msg flow similar to tcp     note: register driver =RequestSubmitDriver register app=ResigterApplication which sends by AppClient to master when startups ...
  as the officials statements,spark is a computation framework,ie u can use it anywhere on which supplys a platform (eg yarn ,mesos) to run .   so in this cluster manager,the all spark's daemons are unnecessary to run the app.feel free to stop all of them.        hadoop@xx:~/spark/spark-1.4.1- ...
  simiar to the prevous article,this one is focused on cluster mode. 1.issue command ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master spark://gzsw-02:6066 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt    note:1) th ...
1.startup command ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode client --master spark://gzsw-02:7077 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user/hadoop/input.txt    note:1) the master is the cluster manager which stated in spark master ui page, ...
  yep ,just the same with your guess,there are many deploy modes in spark,eg standalone,yarn,mesos etc.go advance step,the standalone mode can be devided into standalone,cluster(local) mode.the former is a real cluster mode which the master and workers are all run in individual nodes,while the late ...
1.overview in wordcount -memory tips: Job > Stage > Rdd > Dependency RDDs are linked by Dependencies.   2.terms -RDD is associated by Dependency,ie Dependency is a warpper of RDD.   Stage contains corresponding rdd; dependency contains parent rdd also -Stage is a wrapper of same ...
1  data flow overview  note: -arrow here is means by:bold line is as data line ‘w/o sender and recevier meanings’ but only with data ‘from-to’ -two ways to retieve task result:direct result and  indirect result(over akka frame size)   2.actor in spark 3.several components communicated through E ...
Global site tag (gtag.js) - Google Analytics