如需转载,请标明原文出处以及作者

陈锐 RuiChen @kiwik

2019/06/04 17:02:19


背景

今天开始大数据领域的另一个核心项目Sparkaarch64上的编译和测试,与上一篇博客《在ARM64上编译Hadoop》 不同的是Spark的主要编程语言为Scala,遇到的问题也不尽相同。

编译环境

  • OS: CentOS 7.6
  • Arch: aarch64
  • Host: 华为云ARM公测实例
  • Spark: git commit 8b18ef5c7b670b154494b365a77b9bb015c1a10f (2019-06主干)
  • Java: 1.8.0
  • Maven: 3.6.1
  • Scala: 2.12.8
  • Zinc: 0.3.15

还是从Spark的官方编译文档 入手,我选择用Maven编译。Spark的代码仓库中提供了一个对于mvn的封装bash脚本,用于按照项目 的要求下载和配置特定版本的MavenZincScala,并通过./build/mvn执行编译命令,这引起 了第一个问题:

[root@arm-chenrui spark]# ./build/mvn --version
/root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin/nailgun: line 50: /root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin/ng/linux32/ng: cannot execute binary file
/root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin/nailgun: line 50: /root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin/ng/linux32/ng: cannot execute binary file
Using `mvn` from path: /usr/local/src/apache-maven/bin/mvn
Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-05T03:00:29+08:00)
Maven home: /usr/local/src/apache-maven
Java version: 1.8.0_212, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.aarch64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.14.0-49.el7a.aarch64", arch: "aarch64", family: "unix"
/root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin/nailgun: line 50: /root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin/ng/linux32/ng: cannot execute binary file

注意前两行和最后一行,Spark用到了Zinc对Scala和Java代码进行增量编译,提升编译速度, ./build/mvn脚本中启动了一个Zinc server,间接调用到了nailgun的二进制客户端代码, nailgun的客户端代码用C编写,服务端代码用Java。想了解Zinc和Nailgun,请读这里

如下就是Zinc中包含的调用nailgun的shell脚本,是通过uname命令获取的操作系统和架构信息, 我使用的CentOS Linux和aarch64,会匹配到最后一个分支*linux*, 导致将平台识别为linux32, 在ARM环境上执行一个x86的二进制文件失败。

[root@arm-chenrui spark]# uname -a | tr "[:upper:]" "[:lower:]"
linux arm-chenrui 4.14.0-49.el7a.aarch64 #1 smp tue apr 10 17:22:26 utc 2018 aarch64 aarch64 aarch64 gnu/linux
# zinc-0.3.15/bin/nailgun
if ! ng_exists ; then
  # choose bundled binary based on platform
  declare -r una=$( uname -a | tr "[:upper:]" "[:lower:]" )
  declare platform=""
  case "$una" in
    *cygwin*)        platform="win32" ;;
    *mingw32*)       platform="win32" ;;
    *darwin*x86_64*) platform="darwin64" ;;
    *darwin*)        platform="darwin32" ;;
    *linux*x86_64*)  platform="linux64" ;;
    *linux*ppc64le*) platform="linuxppc64le" ;;
    *linux*)         platform="linux32" ;;
    *)               platform="unknown" ;;
  esac
  cmd="$script_dir/ng/$platform/ng"
  [[ -x "$cmd" ]] && ng_cmd="$cmd"
fi

# run ng client command
"$ng_cmd" "$@"
[root@arm-chenrui bin]# pwd
/root/gopath/src/github.com/apache/spark/build/zinc-0.3.15/bin
[root@arm-chenrui bin]# tree
.
├── nailgun
├── ng
│   ├── darwin32
│   │   └── ng
│   ├── darwin64
│   │   └── ng
│   ├── linux32
│   │   └── ng
│   ├── linux64
│   │   └── ng
│   ├── linuxppc64le
│   │   └── ng
│   └── win32
│       └── ng.exe
└── zinc

7 directories, 8 files
[root@arm-chenrui bin]# file ng/linux32/ng 
ng/linux32/ng: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=6cbc2480f56773bb2df2689506473c806d6b387f, stripped

看了下Zinc已经很多年没有人维护了,可能都在使用 sbt了,Spark也提供了使用sbt编译出包的方法,先放下,看看 Maven有没有规避的方法。

仔细读了一下Spark的build/mvn脚本,结合Spark的pom.xml中的scala-maven-plugin 配置,发现Zinc server模式是一个可选功能,不影响正常的编译。而且Maven插件scala-maven-plugin 会自动解决依赖的Scala包,匹配正确的版本,考虑直接使用系统的mvn命令编译。

If there is no Zinc server currently running then the plugin falls back to regular incremental compilation.

注意:如果第一次用build/mvn编译,会在后台启动一个Zinc的Java进程,需要把进程Kill掉。

开始编译

使用Maven编译,仓库使用国内镜像仓库,提升速度,配置参考我的上一篇博客《在ARM64上编译Hadoop》。

编译命令mvn -DskipTests clean package,在默认参数下编译成功,整体大约15分钟。

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [ 29.829 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 23.509 s]
[INFO] Spark Project Sketch ............................... SUCCESS [  4.940 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  5.628 s]
[INFO] Spark Project Networking ........................... SUCCESS [  9.637 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  5.218 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  9.465 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 12.940 s]
[INFO] Spark Project Core ................................. SUCCESS [03:17 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 10.288 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 13.799 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 38.022 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [02:13 min]
[INFO] Spark Project SQL .................................. SUCCESS [03:59 min]
[INFO] Spark Project ML Library ........................... SUCCESS [02:17 min]
[INFO] Spark Project Tools ................................ SUCCESS [  1.539 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 56.147 s]
[INFO] Spark Project REPL ................................. SUCCESS [  5.218 s]
[INFO] Spark Project Assembly ............................. SUCCESS [  4.633 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [  4.052 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 11.616 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 13.700 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 19.606 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  6.458 s]
[INFO] Spark Avro ......................................... SUCCESS [  7.448 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  16:41 min
[INFO] Finished at: 2019-05-28T11:54:16+08:00
[INFO] ------------------------------------------------------------------------

增加一些参数:mvn -Phadoop-3.2 -Pyarn -Phive -Pmesos -Pkubernetes -DskipTests clean package

单独编译子模块:mvn -Phive -Phive-thriftserver -pl :spark-hive-thriftserver_2.12 -DskipTests clean package

这两种情况下都可以编译成功,可以看到Spark在aarch64环境下的编译是没有问题的。从github上 看Spark的主要编程语言是ScalaJavaPython,没有涉及到CC++Golang的等平台相关 的编程语言。

执行测试

由于OpenLab计划向各个开源社区提供ARM CI环境,我们也会初步验证一下主流软件在ARM环境上的 测试流程。

PySpark Test

首先尝试执行PySpark的测试,需要安装numpy包,CentOS已经在仓库中提供了对应的ARM版本,需要 执行命令yum install python2-numpy,按照Spark官方文档,执行测试:

mvn -DskipTests clean package -Phive
./python/run-tests

直接执行遇到如下错误,跟snappy库相关,Spark用到了snappy-java,它对于aarch64有直接的 支持,但是它对于本地的glibc库版本有要求,错误看起来是CentOS 7.6配套的libstdc++.so 库版本过低,通过命令strings /lib64/libstdc++.so.6 | grep GLIBCXX_3.4确认了一下, CentOS 7.6默认支持gcc 4.8.5GLIBCXX_3.4.19,而snappy-java要求的版本是GLIBCXX_3.4.21。 据说CentOS在生命周期内是不会升级libstdc++.so的版本的,而且我不想在CentOS上再编译gcc, 手动替换libstdc++.so的库文件,这样做可能会引起其他的问题,所以这个问题规避不了。

个人认为,这个问题是CentOS的问题,不是ARM的问题。后续尝试使用Ubuntu 16.04测试。

Caused by: java.lang.UnsatisfiedLinkError: /root/gopath/src/github.com/apache/spark/python/target/9b3ceca5-eed5-4a07-920f-ea1ca2799335/snappy-1.1.7-6c9d7355-f005-4535-9c3f-1065c35c8348-libsnappyjava.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /root/gopath/src/github.com/apache/spark/python/target/9b3ceca5-eed5-4a07-920f-ea1ca2799335/snappy-1.1.7-6c9d7355-f005-4535-9c3f-1065c35c8348-libsnappyjava.so)
        at java.lang.ClassLoader$NativeLibrary.load(Native Method)
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
        at java.lang.Runtime.load0(Runtime.java:809)
        at java.lang.System.load(System.java:1086)
        at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:179)
        at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154)
        at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
        at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
        at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
        at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
        at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.compress(CodecFactory.java:165)
        at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:95)
        at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
        at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235)
        at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
        at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:75)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:275)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1384)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:270)
        ... 9 more

Java Test

直接在Spark源码目录下执行mvn test,遇到的错误如下,也是一个JNI加载本地库文件的问题。

[INFO] Running org.apache.spark.util.kvstore.LevelDBIteratorSuite
[ERROR] Tests run: 38, Failures: 0, Errors: 38, Skipped: 0, Time elapsed: 0.261 s <<< FAILURE! - in org.apache.spark.util.kvstore.LevelDBIteratorSuite
[ERROR] copyIndexDescendingWithStart(org.apache.spark.util.kvstore.LevelDBIteratorSuite)  Time elapsed: 0.229 s  <<< ERROR!
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /root/gopath/src/github.com/apache/spark/common/kvstore/target/tmp/libleveldbjni-64-1-8993977504386930128.8: /root/gopath/src/github.com/apache/spark/common/kvstore/target/tmp/libleveldbjni-64-1-8993977504386930128.8: cannot open shared object file: No such file or directory (Possible cause: can't load AMD 64-bit .so on a AARCH64-bit platform)]
        at org.apache.spark.util.kvstore.LevelDBIteratorSuite.createStore(LevelDBIteratorSuite.java:44)

leveldbjni-all-1.8.jar解压了之后发现,jar包里没有包含对应的aarch64版本的so文件。 查看了一下leveldbjni的源码仓库,最新 主干已经支持了aarch64,但是没做1.9版本的发布,所以在Maven中央仓库的最新版本是1.8, 不包括对于aarch64的支持。很不幸,项目已经两年多无人维护了。

[root@arm-chenrui test]# tree META-INF/native/
META-INF/native/
├── linux32
│   └── libleveldbjni.so
├── linux64
│   └── libleveldbjni.so
├── osx
│   └── libleveldbjni.jnilib
├── windows32
│   └── leveldbjni.dll
└── windows64
    └── leveldbjni.dll

5 directories, 5 files

问题

Spark的这个问题不同于上一篇博客《在ARM64上编译Hadoop》 中遇到的编译问题,这是一个典型的运行时问题。Spark在aarch64上编译通过,但是在运行Python和 Java测试时,加载依赖库出现了问题,这种问题如果不进行全面的自动化CI测试不容易发现。因此后续 OpenLab在向开源项目提供ARM CI时,需要覆盖:编译、UT和 集成测试等多个方面。这样才能有效的促进ARM的服务器端软件生态,让客户的业务系统在ARM服务器 上运行的更为顺畅,也对ARM更有信心。

OpenLab ARM CI 还有很多工作要做。



Published

04 June 2019

Category

ARM

Tags