after a heavy cost time(primary at download huge number of jars),the first example from book 'learning spark' is run through.
the source code is very simple
/** * Illustrates flatMap + countByValue for wordcount. */ package com.oreilly.learningsparkexamples.scala import org.apache.spark._ import org.apache.spark.SparkContext._ object WordCount { def main(args: Array[String]) { val master = args.length match { case x: Int if x > 0 => args(0) case _ => "local" } println("**spark-home :" + System.getenv("SPARK_HOME")) //null if not set val sc = new SparkContext(master, "WordCount", System.getenv("SPARK_HOME")) val input = args.length match { case x: Int if x > 1 => sc.textFile(args(1)) case _ => sc.parallelize(List("pandas", "i like pandas")) //-else generate a list as input data } val words = input.flatMap(line => line.split(" ")) args.length match { case x: Int if x > 2 => { val counts = words.map(word => (word, 1)).reduceByKey{case (x,y) => x + y} //-same as xxByKey((x,y)=> x+y) counts.saveAsTextFile(args(2)) } case _ => { //-else count by words number val wc = words.countByValue() println(wc.mkString(",")) } } } }
and the project outline will figure like this:
below is the running logs in local mode:
JHLinMacBook:learning-spark-src userxx$ spark-submit --verbose --class com.oreilly.learningsparkexamples.scala.WordCount target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar Using properties file: null Parsed arguments: master local[*] deployMode null executorMemory null executorCores null totalExecutorCores null propertiesFile null driverMemory null driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors null files null pyFiles null archives null mainClass com.oreilly.learningsparkexamples.scala.WordCount primaryResource file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar name com.oreilly.learningsparkexamples.scala.WordCount childArgs [] jars null packages null repositories null verbose true Spark properties used, including those specified through --conf and those from the properties file null: Main class: com.oreilly.learningsparkexamples.scala.WordCount Arguments: System properties: SPARK_SUBMIT -> true spark.app.name -> com.oreilly.learningsparkexamples.scala.WordCount spark.jars -> file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar spark.master -> local[*] Classpath elements: file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar /Users/userxx/Cloud/Spark/spark-1.4.1-bin-hadoop2.4 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/09/22 23:20:13 INFO SparkContext: Running Spark version 1.4.1 2015-09-22 23:20:13.411 java[918:1903] Unable to load realm info from SCDynamicStore 15/09/22 23:20:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/09/22 23:20:13 WARN Utils: Your hostname, JHLinMacBook resolves to a loopback address: 127.0.0.1; using 192.168.1.144 instead (on interface en0) 15/09/22 23:20:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/09/22 23:20:13 INFO SecurityManager: Changing view acls to: userxx 15/09/22 23:20:13 INFO SecurityManager: Changing modify acls to: userxx 15/09/22 23:20:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userxx); users with modify permissions: Set(userxx) 15/09/22 23:20:14 INFO Slf4jLogger: Slf4jLogger started 15/09/22 23:20:14 INFO Remoting: Starting remoting 15/09/22 23:20:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.144:49613] 15/09/22 23:20:14 INFO Utils: Successfully started service 'sparkDriver' on port 49613. 15/09/22 23:20:14 INFO SparkEnv: Registering MapOutputTracker 15/09/22 23:20:14 INFO SparkEnv: Registering BlockManagerMaster 15/09/22 23:20:14 INFO DiskBlockManager: Created local directory at /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6 15/09/22 23:20:14 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/09/22 23:20:15 INFO HttpFileServer: HTTP File server directory is /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/httpd-38c60499-b2cf-4fe1-b94f-f3d04fcae1b3 15/09/22 23:20:15 INFO HttpServer: Starting HTTP Server 15/09/22 23:20:15 INFO Utils: Successfully started service 'HTTP file server' on port 49614. 15/09/22 23:20:15 INFO SparkEnv: Registering OutputCommitCoordinator 15/09/22 23:20:15 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/09/22 23:20:15 INFO SparkUI: Started SparkUI at http://192.168.1.144:4040 15/09/22 23:20:15 INFO SparkContext: Added JAR file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar at http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534 15/09/22 23:20:15 INFO Executor: Starting executor ID driver on host localhost 15/09/22 23:20:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49616. 15/09/22 23:20:15 INFO NettyBlockTransferService: Server created on 49616 15/09/22 23:20:15 INFO BlockManagerMaster: Trying to register BlockManager 15/09/22 23:20:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49616 with 265.4 MB RAM, BlockManagerId(driver, localhost, 49616) 15/09/22 23:20:15 INFO BlockManagerMaster: Registered BlockManager 15/09/22 23:20:16 INFO SparkContext: Starting job: countByValue at WordCount.scala:28 15/09/22 23:20:16 INFO DAGScheduler: Registering RDD 3 (countByValue at WordCount.scala:28) 15/09/22 23:20:16 INFO DAGScheduler: Got job 0 (countByValue at WordCount.scala:28) with 1 output partitions (allowLocal=false) 15/09/22 23:20:16 INFO DAGScheduler: Final stage: ResultStage 1(countByValue at WordCount.scala:28) 15/09/22 23:20:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 15/09/22 23:20:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 15/09/22 23:20:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28), which has no missing parents 15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(3072) called with curMem=0, maxMem=278302556 15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KB, free 265.4 MB) 15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(1755) called with curMem=3072, maxMem=278302556 15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1755.0 B, free 265.4 MB) 15/09/22 23:20:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49616 (size: 1755.0 B, free: 265.4 MB) 15/09/22 23:20:16 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874 15/09/22 23:20:16 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28) 15/09/22 23:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/09/22 23:20:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1465 bytes) 15/09/22 23:20:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/09/22 23:20:16 INFO Executor: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534 15/09/22 23:20:17 INFO Utils: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar to /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/fetchFileTemp4094197159668194166.tmp 15/09/22 23:20:17 INFO Executor: Adding file:/private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/learning-spark-examples_2.10-0.0.1.jar to class loader 15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 879 bytes result sent to driver 15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 300 ms on localhost (1/1) 15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/09/22 23:20:17 INFO DAGScheduler: ShuffleMapStage 0 (countByValue at WordCount.scala:28) finished in 0.332 s 15/09/22 23:20:17 INFO DAGScheduler: looking for newly runnable stages 15/09/22 23:20:17 INFO DAGScheduler: running: Set() 15/09/22 23:20:17 INFO DAGScheduler: waiting: Set(ResultStage 1) 15/09/22 23:20:17 INFO DAGScheduler: failed: Set() 15/09/22 23:20:17 INFO DAGScheduler: Missing parents for ResultStage 1: List() 15/09/22 23:20:17 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28), which is now runnable 15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(2304) called with curMem=4827, maxMem=278302556 15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 265.4 MB) 15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(1371) called with curMem=7131, maxMem=278302556 15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1371.0 B, free 265.4 MB) 15/09/22 23:20:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49616 (size: 1371.0 B, free: 265.4 MB) 15/09/22 23:20:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874 15/09/22 23:20:17 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28) 15/09/22 23:20:17 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/09/22 23:20:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1245 bytes) 15/09/22 23:20:17 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms 15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1076 bytes result sent to driver 15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 84 ms on localhost (1/1) 15/09/22 23:20:17 INFO DAGScheduler: ResultStage 1 (countByValue at WordCount.scala:28) finished in 0.084 s 15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/09/22 23:20:17 INFO DAGScheduler: Job 0 finished: countByValue at WordCount.scala:28, took 0.702187 s pandas -> 2,i -> 1,like -> 1 15/09/22 23:20:17 INFO SparkContext: Invoking stop() from shutdown hook 15/09/22 23:20:17 INFO SparkUI: Stopped Spark web UI at http://192.168.1.144:4040 15/09/22 23:20:17 INFO DAGScheduler: Stopping DAGScheduler 15/09/22 23:20:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 15/09/22 23:20:17 INFO Utils: path = /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6, already present as root for deletion. 15/09/22 23:20:17 INFO MemoryStore: MemoryStore cleared 15/09/22 23:20:17 INFO BlockManager: BlockManager stopped 15/09/22 23:20:17 INFO BlockManagerMaster: BlockManagerMaster stopped 15/09/22 23:20:17 INFO SparkContext: Successfully stopped SparkContext 15/09/22 23:20:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 15/09/22 23:20:17 INFO Utils: Shutdown hook called 15/09/22 23:20:17 INFO Utils: Deleting directory /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2
as u see, i use 'verbose' mode to submit this application,so certain details are more clear.
tips:
one thing u should know is that command spark-submit parameters are ugly:
spark-submit [options] <app jar | python file> [app arguments]
so the app jar should follow the options(if any),then app args.
e.g. submit a wordcount app to standalone-client mode spark ensemble:
spark-submit --master spark://gzsw-02:7077 --class org.apache.spark.examples.JavaWordCount --executor-memory 600m --total-executor-cores 16 --verbose --deploy-mode client /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/spark-examples-1.4.1-hadoop2.4.0.jar /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/RELEASE 2 output-result
param-2:minpartitions; output-result:output all the wc detailed result
supplement
yep,spark can use do certain 'memory-computings',though,we have already used memory to do something before,but may be not to form some theorys yet.
相关推荐
spark-3.1.2.tgz版本 & spark-3.1.2-bin-hadoop2.7.tgz版本
本资源是spark-2.0.0-bin-hadoop2.6.tgz百度网盘资源下载,本资源是spark-2.0.0-bin-hadoop2.6.tgz百度网盘资源下载
Apache Spark版本3.1.3。Linux安装包。spark-3.1.3-bin-hadoop3.2.tgz
Spark安装包:spark-3.1.3-bin-without-hadoop.tgz
Spark Doris Connector(apache-doris-spark-connector-2.3_2.11-1.0.1-incubating-src.tar.gz) Spark Doris Connector Version:1.0.1 Spark Version:2.x Scala Version:2.11 Apache Doris是一个现代MPP分析...
spark-3.0.0-bin-hadoop3.2下载安装包
spark-hive_2.11-2.3.0 spark-hive-thriftserver_2.11-2.3.0.jar log4j-2.15.0.jar slf4j-api-1.7.7.jar slf4j-log4j12-1.7.25.jar curator-client-2.4.0.jar curator-framework-2.4.0.jar curator-recipes-2.4.0....
内容概要:由于cdh6.3.2的spark版本为2.4.0,并且spark-sql被阉割,现基于cdh6.3.2,scala2.12.0,java1.8,maven3.6.3,,对spark-3.2.2源码进行编译 应用:该资源可用于cdh6.3.2集群配置spark客户端,用于spark-sql
pyspark本地的环境配置包,spark-2.3.4-bin-hadoop2.7.tgz:spark-2.3.4-bin-hadoop2.7.tgz
spark-3.2.0-bin-hadoop3.2.tgz
spark-3.2.4-bin-hadoop3.2-scala2.13 安装包
这是每个学习spark必备的jar包,是根据我的个人试验后所得,官网正版,在spark官网下载。 资源包里不仅有需要的jar包,并且给不会再官网上下载的新手官方网址,可以自由下载资源
spark-3.2.1-bin-hadoop3.2-scala2.13.tgz
spark-streaming-flume_2.11-2.1.0.jar
linux的spark新版本,匹配hadoop2.7版本,spark-3.2.1-bin-hadoop2.7.tgz
spark-assembly-1.5.2-hadoop2.6.0 在spark编程中使用的一个jar
spark-2.4.8-bin-hadoop2.7.tgz
mongodb-spark官方连接器,运行spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:1.1.0可以自动下载,国内网络不容易下载成功,解压后保存到~/.ivy2目录下即可。
spark-2.4.0-bin-hadoop2.7
Spark Doris Connector(apache-doris-spark-connector-3.1_2.12-1.0.1-incubating-src.tar.gz) Spark Doris Connector Version:1.0.1 Spark Version:3.x Scala Version:2.12 Apache Doris是一个现代MPP分析...