使用 Docker 搭建 Hadoop 集群

最近在尝试搭建 Hadoop 集群,想来想去还是决定使用 Docker 进行搭建,一来占用的资源少,二来方便快捷。本文将使用一行命令搞定 Hadoop 集群的搭建。

  • 本文中的命令需要机器上安装有 Docker 才可正常运行。
  • 本文使用的 Docker 版本为 :docker-ce:18.09.0
  • 本文涉及到的脚本和相关文件地址:GitHub/CodingFanlt/Scrip

运行

先来看一下最终效果,以下脚本在安装好 Docker 的机器上直接运行,将自动搭建并启动一个 Hadoop 集群。

可以使用 curl 或者 wget 进行脚本的下载工作。

话不多说,一行命令搞定:

使用 curl :

1
curl -O https://raw.githubusercontent.com/CodingFanlt/Scrip/master/docker/hadoop/hadoop-2.9.2/start.sh && chmod +x start.sh && ./start.sh

使用 wget :

1
wget https://raw.githubusercontent.com/CodingFanlt/Scrip/master/docker/hadoop/hadoop-2.9.2/start.sh && chmod +x start.sh && ./start.sh

执行完以上命令,系统将创建出一个 master 节点和两个 slave 节点。第一次运行时由于本地没有所需的 Docker 镜像,系统将联网自动下载,过程会比较慢。

以上命令会再运行命令的文件夹内下载一个 start.sh 的脚本文件,以后需要运行 Hadoop 集群时可以通过再次运行该脚本来完成。

可以使用以下命令指定 slave 节点个数,最多不超过五个。

指定 slave 节点个数:

1
./start.sh 5

注意,如果本用户未加入 docker 用户组的话,需要使用 sudo 进行执行。

创建玩节点后将自动进入 master 节点,也可以使用以下命令来手动进入 master 或 slave 节点

进入 Docker 容器:

1
docker exec -it your_node_name bash

在 master 节点内,可以使用以下命令来启动集群。

启动集群:

1
./start-hadoop.sh

接下来可以使用以下命令来运行 wordcount 程序。

运行 wordcount :

1
./run-wordcount.sh

你可以进入 your_ip:8088your_ip:50070 来查看集群运行状态。

关闭集群

首先使用 exit 命令从容器中退出,然后执行下面的命令即可关闭集群。

使用 curl :

1
curl -O https://raw.githubusercontent.com/CodingFanlt/Scrip/master/docker/hadoop/hadoop-2.9.2/stop.sh && chmod +x stop.sh && ./stop.sh

使用 wget :

1
wget https://raw.githubusercontent.com/CodingFanlt/Scrip/master/docker/hadoop/hadoop-2.9.2/stop.sh && chmod +x stop.sh && ./stop.sh

构建

  • 注:本节使用的 Dockerfile 以及相关配置文件可能与 Github 上的版本有所差别,具体请参考 GitHub 上的版本。

  • 本文中使用的软件版本如下:

    • 基础镜像:Ubuntu:16.04
    • JDK 版本:openjdk-8-jdk
    • Hadoop 版本:hadoop-2.9.2

如果有需要,你可以更改相关配置文件,然后重新构建镜像。

下载配置文件:

1
git clone https://github.com/CodingFanlt/Scrip.git

重新构建:

1
2
cd Scrip/docker/hadoop/hadoop-2.9.2/
docker build -t your_image_name .

重新构建后,启动脚本中的 codingfanlt/hadoop:2.9.2 也需要更改为新构建的镜像名称。

Dockerfile

为了方便安装软件包,这里我使用之前制作的镜像作为基础镜像(该镜像只是将软件源替换为国内 USTC ),使用的 Hadoop 版本为 2.9.2 。

先来看看最重要的 Dockerfile 文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
FROM codingfanlt/ubuntu-ustc-source:16.04

MAINTAINER CodingFanlt <CodingFanlt@Gmail.com>

WORKDIR /root

# install openssh-server, openjdk and wget
RUN apt-get update && apt-get upgrade -y\
&& apt-get install -y openssh-server openjdk-8-jdk wget vim net-tools iputils-ping

# install hadoop 2.9.2
RUN wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz \
&& tar -xzvf hadoop-2.9.2.tar.gz \
&& mv hadoop-2.9.2 /usr/local/hadoop \
&& rm hadoop-2.9.2.tar.gz

# set environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ENV HADOOP_HOME=/usr/local/hadoop
ENV PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

# ssh without key
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' \
&& cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

RUN mkdir -p ~/hdfs/namenode \
&& mkdir -p ~/hdfs/datanode \
&& mkdir $HADOOP_HOME/logs

# copy file
COPY config/* /tmp/

RUN mv /tmp/ssh_config ~/.ssh/config \
&& mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh \
&& mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml \
&& mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml \
&& mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml \
&& mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml \
&& mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves \
&& mv /tmp/start-hadoop.sh ~/start-hadoop.sh \
&& mv /tmp/run-wordcount.sh ~/run-wordcount.sh

# add executable permissions
RUN chmod +x ~/start-hadoop.sh \
&& chmod +x ~/run-wordcount.sh \
&& chmod +x $HADOOP_HOME/sbin/start-dfs.sh \
&& chmod +x $HADOOP_HOME/sbin/start-yarn.sh

# format namenode
RUN /usr/local/hadoop/bin/hdfs namenode -format

CMD [ "sh", "-c", "service ssh start; bash"]

在构建文件中,安装了必要的包,设置相关环境变量,并将所需的脚本和配置文件拷贝到指定的位置。

start.sh

接下来是开始运行的脚本 start.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/bash

# the default node number is 2
N=${1:-2} # take effect when the variable is an empty string

# create hadoop network
docker network create --driver=bridge hadoop

# start hadoop master container
docker rm -f hadoop-master &> /dev/null
echo "start hadoop-master container..."
docker run -itd \
--net=hadoop \
-p 50070:50070 \
-p 8088:8088 \
--name hadoop-master \
--hostname hadoop-master \
codingfanlt/hadoop:2.9.2 &> /dev/null

# start hadoop slave container
i=1
# while [ $i -lt $N ]
while (("$i" <= "$N"))
do
docker rm -f hadoop-slave$i &> /dev/null
echo "start hadoop-slave$i container..."
docker run -itd \
--net=hadoop \
--name hadoop-slave$i \
--hostname hadoop-slave$i \
codingfanlt/hadoop:2.9.2 &> /dev/null
i=$(( $i + 1 ))
done

# get into hadoop master container
docker exec -it hadoop-master bash

该脚本中的内容也非常简单,首先定义了一个默认的 slave 节点数量(若接收到参数可以代替默认值),然后创建一张虚拟的网卡,专门用来给 Hadoop 集群用。在这之后先清理一下之前的容器,并将控制台输出丢弃掉,然后创建相应的容器,最后一步进入 Master 节点。

start-hadoop.sh

接下来来看看 Master 节点内的 start-hadoop.sh 文件:

1
2
3
4
5
6
7
8
9
10
11
#!/bin/bash

echo -e "\n"

$HADOOP_HOME/sbin/start-dfs.sh

echo -e "\n"

$HADOOP_HOME/sbin/start-yarn.sh

echo -e "\n"

这个脚本就更简单了,启动 HDFS ,启动 YARN 。
该脚本在容器内。

run-wordcount.sh

run-wordcount.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash

# test the hadoop cluster by running wordcount

# create input files
mkdir input
echo "Hello Docker" >input/file2.txt
echo "Hello Hadoop" >input/file1.txt

# create input directory on HDFS
hadoop fs -mkdir -p input

# put input files to HDFS
hadoop fs -put ./input/* input

# run wordcount
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.9.2-sources.jar org.apache.hadoop.examples.WordCount input output

# print the input files
echo -e "\ninput file1.txt:"
hadoop fs -cat input/file1.txt

echo -e "\ninput file2.txt:"
hadoop fs -cat input/file2.txt

# print the output of wordcount
echo -e "\nwordcount output:"
hadoop fs -cat output/part-r-00000

这个脚本用来创建两个文件,将这两个文件放进 HDFS 中,然后启动 Hadoop 中自带的 WordCount 程序,之后读取结果并输出。
该脚本在容器内。

Hadoop 相关配置文件

剩下的文件是 Hadoop 相关的配置文件,这里不做过多赘述,直接放文件了。

core-site.xml

1
2
3
4
5
6
7
8
<?xml version="1.0"?>
<configuration>
<!-- define the default file system host and port -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
</configuration>

hadoop-env.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Enable extra debugging of Hadoop's JAAS binding, used to set up
# Kerberos security.
# export HADOOP_JAAS_DEBUG=true

# Extra Java runtime options. Empty by default.
# For Kerberos debugging, an extended option set logs more invormation
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS"
# set heap args when HADOOP_HEAPSIZE is empty
if [ "$HADOOP_HEAPSIZE" = "" ]; then
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
fi
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
#export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Router-based HDFS Federation specific parameters
# Specify the JVM options to be used when starting the RBF Routers.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_DFSROUTER_OPTS=""
###

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

这个文件需要稍微注意一下,在第 25 行中,原本的 JAVA_HOME 没有写绝对路径,会导致 Hadoop 找不到 JDK 路径,这里一定要写绝对路径,不同的 JDK 版本,不同的 CPU 架构下都会有所不同。

hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version="1.0"?>
<configuration>
<!-- set namenode storage path-->
<!-- storage node info -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///root/hdfs/namenode</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
</property>
<!-- set datanode storage path-->
<!-- storage data -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///root/hdfs/datanode</value>
<description>DataNode directory</description>
</property>
<!-- set the number of copies, default 3, reset to 2 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

这个文件用于设置备份个数以及 HDFS 节点的数据存储路径。

mapred-site.xml

1
2
3
4
5
6
7
8
<?xml version="1.0"?>
<configuration>
<!-- specify the frame name -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

指明要使用的 mapreduce 框架。

slaves

1
2
3
4
5
hadoop-slave1
hadoop-slave2
hadoop-slave3
hadoop-slave4
hadoop-slave5

用来指定 Slave 节点的主机名,由于只是模拟环境,并不是真实的环境,因此我将 Slave 节点设置为 5 个,有需要可以再添加。

ssh_config

1
2
3
4
5
6
7
8
9
10
11
12
13
Host localhost
# automatically accept the public key the first time you connect
StrictHostKeyChecking no

Host 0.0.0.0
# automatically accept the public key the first time you connect
StrictHostKeyChecking no

Host hadoop-*
# automatically accept the public key the first time you connect
StrictHostKeyChecking no
# the location where know_hosts is stored. Generally, you don't need to configure this item separately. When StrictHostKeyChecking is set to 'no', it can be set to /dev/null.
UserKnownHostsFile=/dev/null

由于 ssh 远程登陆时需要确认主机(第一次登陆会让你确认,然后才会将该主机的指纹信息写入本机的 ~/.ssh/known_hosts 文件中),这里设置默认接受,不必再询问。

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<?xml version="1.0"?>
<configuration>
<!-- Ancillary services running on the NodeManager. You need to configure "mapreduce_shuffle" to run the MapReduce program. -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- The class corresponding to the auxiliary service in the NodeManager. -->
<!-- <property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property> -->
<!-- Configuration name node -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>

YARN 相关配置。

最近比较忙,没时间把介绍写的太详细,如果有疑问请留言。

Docker 相关命令

停止所有 Hadoop 相关的容器

1
docker stop $(docker ps | grep "hadoop*" | awk '{print $1}')

删除所有 Hadoop 相关容器

1
docker rm $(docker container ls -a | grep "hadoop*" | awk '{print $1}')

删除所有 Hadoop 相关镜像

1
docker image rm $(docker image ls | grep "codingfanlt/hadoop*" | awk '{print $1}')

参考

sequenceiq/hadoop-docker
kiwenlau/hadoop-cluster-docker