Spark项目打包和发布

###SBT 常用命令

actions – 显示对当前工程可用的命令
update – 下载依赖
compile – 编译代码
test – 运行测试代码
package – 创建一个可发布的jar包
publish-local – 把构建出来的jar包安装到本地的ivy缓存
publish – 把jar包发布到远程仓库(如果配置了的话)

sbt-assembly

使用sbt-assembly打jar主要有两个版本0.11.2和0.13.0

对于0.11.2版

在project目录加 assembly.sbt 写入:

$ cat project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

然后在项目根目录加入 build.sbt ,因为spark包含了scala库所以不需要再次包含。

    import AssemblyKeys._

    name := "Simple Project"
    version := "1.0"
    organization := "com.databricks"
    scalaVersion := "2.10.4"

    // 加入第三方依赖
    libraryDependencies ++= Seq(
            "org.apache.hadoop" % "hadoop-client" % "2.3.0" % "provided",
            "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
            // Third-party libraries
            "net.sf.jopt-simple" % "jopt-simple" % "4.3",
            "joda-time" % "joda-time" % "2.0"
    )

    // This statement includes the assembly plug-in capabilities
    assemblySettings
    // Configure JAR used with the assembly plug-in
    jarName in assembly := "my-project-assembly.jar"
    // A special option to exclude Scala itself form our assembly JAR, since Spark
    // already bundles Scala.
    assemblyOption in assembly :=
    (assemblyOption in assembly).value.copy(includeScala = false)

对于0.13.0版

在project目录加 assembly.sbt 写入:

$ cat project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

然后在项目根目录加入 build.sbt ,因为spark包含了scala库所以不需要再次包含。

name := "hello"
version := "1.0"
scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
    "org.apache.hadoop" % "hadoop-client" % "2.3.0" % "provided",
    "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
)

// Configure JAR used with the assembly plug-in
assemblyJarName  in assembly := "wordcount-assembly.jar"
// A special option to exclude Scala itself form our assembly JAR, since Spark
// already bundles Scala.
assemblyOption in assembly :=
(assemblyOption in assembly).value.copy(includeScala = false)

然后使用 sbt assembly就可以打jar包了。

> assembly
[info] Checking every *.class/*.jar file's SHA-1.
[info] Merging files...
[info] Assembly up to date: /usr/home/service/projects/hello/target/scala-2.11/wordcount-assembly.jar
[success] Total time: 0 s, completed 2015-4-2 11:27:43

使用eclipse sbt 插件

在相关的sbt用户目录加入全局插件配置:

// ~\.sbt\0.13\plugins global plugin dir
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "3.0.0")

然后就可以使用 sbt eclipse构建初始化eclipse项目

>> sbt eclipse
[info] About to create Eclipse project files for your project(s).
[info] Successfully created Eclipse project files for project(s):
[info] app

依赖导出

在build.sbt中设置:

retrieveManaged := true

SBT打包Java应用

使用sbt管理java项目build.sbt配置如下

// Project name (artifact name in Maven)
name := "$Project_name"

// orgnization name (e.g., the package name of the project)
organization := "$Organization"

version := "1.0-SNAPSHOT"

// project description
description := "Treasure Data Project"

// Enables publishing to maven repo
publishMavenStyle := true

// Do not append Scala versions to the generated artifacts
crossPaths := false

// This forbids including Scala related libraries into the dependency
autoScalaLibrary := false

// library dependencies. (orginization name) % (project name) % (version)
libraryDependencies ++= Seq(
   "org.apache.commons" % "commons-math3" % "3.1.1",
   "org.fluentd" % "fluent-logger" % "0.2.10",
   "org.mockito" % "mockito-core" % "1.9.5" % "test"  // Test-only dependency
)

Maven打包Java应用

POM配置

<project>
    <modelVersion>4.0.0</modelVersion>
    <!-- Information about your project -->
    <groupId>com.databricks</groupId>
    <artifactId>example-build</artifactId>
    <name>Simple Project</name>
    <packaging>jar</packaging>
    <version>1.0</version>
    <dependencies>
        <!-- Spark dependency -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>1.2.0</version>
            <scope>provided</scope>
        </dependency>
        <!-- Third-party library -->
        <dependency>
            <groupId>net.sf.jopt-simple</groupId>
            <artifactId>jopt-simple</artifactId>
            <version>4.3</version>
        </dependency>
        <!-- Third-party library -->
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.0</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <!-- Maven shade plug-in that creates uber JARs -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                    Packaging Your Code and Dependencies | 125
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

打包命令

$ mvn package
# In the target directory, we'll see an uber JAR and the original package JAR
$ ls target/
example-build-1.0.jar
original-example-build-1.0.jar
# Listing the uber JAR will reveal classes from dependency libraries
$ jar tf target/example-build-1.0.jar
...
joptsimple/HelpFormatter.class
...
org/joda/time/tz/UTCProvider.class
...
# An uber JAR can be passed directly to spark-submit
$ /path/to/spark/bin/spark-submit --master local ... target/example-build-1.0.jar

部署提交

spark的app通过spark的spark-sumbit提交任务可以提交python,java,scala写的程序。

提交 Python 应用

bin/spark-submit my_script.py

Using spark-submit with various options
# Submitting a Java application to Standalone cluster mode
$ ./bin/spark-submit \
--master spark://hostname:7077 \
--deploy-mode cluster \
--class com.databricks.examples.SparkExample \
--name "Example Program" \
--jars dep1.jar,dep2.jar,dep3.jar \
--total-executor-cores 300 \
--executor-memory 10g \
myApp.jar "options" "to your application" "go here"
# Submitting a Python application in YARN client mode
$ export HADOP_CONF_DIR=/opt/hadoop/conf
$ ./bin/spark-submit \
--master yarn \
--py-files somelib-1.2.egg,otherlib-4.4.zip,other-file.py \
--deploy-mode client \
--name "Example Program" \
--queue exampleQueue \
--num-executors 40 \
--executor-memory 10g \
my_script.py "options" "to your application" "go here"