Docker+Hadoop+Hive+Presto 使用Docker部署Hadoop环境和Presto

 

Background

一. 什么是Presto

Presto通过使用分布式查询,可以快速高效的完成海量数据的查询。如果你需要处理TB或者PB级别的数据,那么你可能更希望借助于Hadoop和HDFS来完成这些数据的处理。作为Hive和Pig(Hive和Pig都是通过MapReduce的管道流来完成HDFS数据的查询)的替代者,Presto不仅可以访问HDFS,也可以操作不同的数据源,包括:RDBMS和其他的数据源(例如:Cassandra)。

Presto被设计为数据仓库和数据分析产品:数据分析、大规模数据聚集和生成报表。这些工作经常通常被认为是线上分析处理操作。

Presto是FaceBook开源的一个开源项目。Presto在FaceBook诞生,并且由FaceBook内部工程师和开源社区的工程师公共维护和改进。

 

二. 环境和应用准备

  • 环境

  macbook pro

  • application

  Docker for mac: https://docs.docker.com/docker-for-mac/#check-versions

  jdk-1.8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

  hadoop-2.7.5

  hive-2.3.3

  presto-server-0.198.tar.gz

  presto-cli-0.198-executable.jar

 

三. 构建images

我们使用Docker来启动三台Centos7虚拟机,三台机器上安装Hadoop和Java。

1. 安装Docker,Macbook上安装Docker,并使用仓库账号登录。

docker login

2. 验证安装结果

docker version

3. 拉取Centos7 images

docker pull centos

4. 构建具有ssh功能的centos

mkdir ~/centos-ssh
cd centos-ssh
vi Dockerfile
# 选择一个已有的os镜像作为基础  
FROM centos 

# 镜像的作者  
MAINTAINER crxy 

# 安装openssh-server和sudo软件包,并且将sshd的UsePAM参数设置成no  
RUN yum install -y openssh-server sudo  
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config  
#安装openssh-clients
RUN yum  install -y openssh-clients

# 添加测试用户root,密码root,并且将此用户添加到sudoers里  
RUN echo "root:root" | chpasswd  
RUN echo "root   ALL=(ALL)       ALL" >> /etc/sudoers  
# 下面这两句比较特殊,在centos6上必须要有,否则创建出来的容器sshd不能登录  
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key  
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key  

# 启动sshd服务并且暴露22端口  
RUN mkdir /var/run/sshd  
EXPOSE 22  
CMD ["/usr/sbin/sshd", "-D"]

构建

docker build -t=”centos-ssh” .

5. 基于centos-ssh镜像构建有JDK和Hadoop的镜像

mkdir ~/hadoop
cd hadoop
vi Dockerfile
FROM centos-ssh
ADD jdk-8u161-linux-x64.tar.gz /usr/local/
RUN mv jdk-8u161-linux-x64.tar.gz /usr/local/jdk1.7
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH

ADD hadoop-2.7.5.tar.gz /usr/local
RUN mv hadoop-2.7.5.tar.gz /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

jdk包和hadoop包要放在hadoop目录下

docker build -t=”centos-hadoop” .

 

四. 搭建Hadoop集群

1. 集群规划

搭建有三个节点的hadoop集群,一主两从

主节点:hadoop0 ip:172.18.0.2 
从节点1:hadoop1 ip:172.18.0.3 
从节点2:hadoop2 ip:172.18.0.4

但是由于docker容器重新启动之后ip会发生变化,所以需要我们给docker设置固定ip。

Docker安装后,默认会创建下面三种网络类型:

docker network ls                                                                                                                                                                                                                           jinhongliu@Jinhongs-MacBo
NETWORK ID          NAME                DRIVER              SCOPE
085be4855a90        bridge              bridge              local
177432e48de5        host                host                local
569f368d1561        none                null                local

启动 Docker的时候,用 --network 参数,可以指定网络类型,如:

~ docker run -itd --name test1 --network bridge --ip 172.17.0.10 centos:latest /bin/bash

bridge:桥接网络

默认情况下启动的Docker容器,都是使用 bridge,Docker安装时创建的桥接网络,每次Docker容器重启时,会按照顺序获取对应的IP地址,这个就导致重启下,Docker的IP地址就变了.

none:无指定网络

使用 --network=none ,docker 容器就不会分配局域网的IP

host: 主机网络

使用 --network=host,此时,Docker 容器的网络会附属在主机上,两者是互通的。

例如,在容器中运行一个Web服务,监听8080端口,则主机的8080端口就会自动映射到容器中。

创建自定义网络:(设置固定IP)

启动Docker容器的时候,使用默认的网络是不支持指派固定IP的,如下:

~ docker run -itd --net bridge --ip 172.17.0.10 centos:latest /bin/bash
6eb1f228cf308d1c60db30093c126acbfd0cb21d76cb448c678bab0f1a7c0df6
docker: Error response from daemon: User specified IP address is supported on user defined networks only.

因此,需要创建自定义网络,下面是具体的步骤:

步骤1: 创建自定义网络

创建自定义网络,并且指定网段:172.18.0.0/16

➜ ~ docker network create --subnet=172.18.0.0/16 mynetwork
➜ ~ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
085be4855a90        bridge              bridge              local
177432e48de5        host                host                local
620ebbc09400        mynetwork           bridge              local
569f368d1561        none                null                local

步骤2: 创建docker容器。启动三个容器,分别作为hadoop0 hadoop1 hadoop2

➜  ~ docker run --name hadoop0 --hostname hadoop0 --net mynetwork --ip 172.18.0.2 -d -P -p 50070:50070 -p 8088:8088  centos-hadoop
➜  ~ docker run --name hadoop0 --hostname hadoop1 --net mynetwork --ip 172.18.0.3 -d -P centos-hadoop
➜  ~ docker run --name hadoop0 --hostname hadoop2 --net mynetwork --ip 172.18.0.4 -d -P centos-hadoop

使用docker ps 查看刚才启动的是三个容器:

5e0028ed6da0        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 3 hours          0.0.0.0:32771->22/tcp                                                     hadoop2
35211872eb20        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 4 hours          0.0.0.0:32769->22/tcp                                                     hadoop1
0f63a870ef2b        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 5 hours          0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32768->22/tcp   hadoop0

这样3台机器就有了固定的IP地址。验证一下,分别ping三个ip,能ping通就说明没问题。

 

五. 配置Hadoop集群

1. 先连接到hadoop0上, 使用命令

docker exec -it hadoop0 /bin/bash

下面的步骤就是hadoop集群的配置过程 
1:设置主机名与ip的映射,修改三台容器:vi /etc/hosts 
添加下面配置

172.18.0.2    hadoop0
172.18.0.3    hadoop1
172.18.0.4    hadoop2

2:设置ssh免密码登录 
在hadoop0上执行下面操作

cd  ~
mkdir .ssh
cd .ssh
ssh-keygen -t rsa(一直按回车即可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop0
ssh-copy-id -i hadoop1
ssh-copy-id -i hadoop2
在hadoop1上执行下面操作
cd  ~
cd .ssh
ssh-keygen -t rsa(一直按回车即可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop1
在hadoop2上执行下面操作
cd  ~
cd .ssh
ssh-keygen -t rsa(一直按回车即可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop2

3:在hadoop0上修改hadoop的配置文件 
进入到/usr/local/hadoop/etc/hadoop目录 
修改目录下的配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml 
(1)hadoop-env.sh

export JAVA_HOME=/usr/local/jdk1.8

(2)core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop0:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/usr/local/hadoop/tmp</value>
        </property>
         <property>
                 <name>fs.trash.interval</name>
                 <value>1440</value>
        </property>
</configuration>

(3)hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
</configuration>

(4)yarn-site.xml

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property> 
                <name>yarn.log-aggregation-enable</name> 
                <value>true</value> 
        </property>
</configuration>

(5)修改文件名:mv mapred-site.xml.template mapred-site.xml 
vi mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

(6)格式化 
进入到/usr/local/hadoop目录下 
执行格式化命令

bin/hdfs namenode -format
注意:在执行的时候会报错,是因为缺少which命令,安装即可

执行下面命令安装
yum install -y which

格式化操作不能重复执行。如果一定要重复格式化,带参数-force即可。

(7)启动伪分布hadoop

命令:sbin/start-all.sh

第一次启动的过程中需要输入yes确认一下。 使用jps,检查进程是否正常启动?能看到下面几个进程表示伪分布启动成功

3267 SecondaryNameNode
3003 NameNode
3664 Jps
3397 ResourceManager
3090 DataNode
3487 NodeManager

(8)停止伪分布hadoop

命令:sbin/stop-all.sh

(9)指定nodemanager的地址,修改文件yarn-site.xml

<property>
    <description>The hostname of the RM.</description>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop0</value>
  </property>

(10)修改hadoop0中hadoop的一个配置文件etc/hadoop/slaves 
删除原来的所有内容,修改为如下

hadoop1
hadoop2

(11)在hadoop0中执行命令

scp  -rq /usr/local/hadoop   hadoop1:/usr/local
scp  -rq /usr/local/hadoop   hadoop2:/usr/local

(12)启动hadoop分布式集群服务

执行sbin/start-all.sh

注意:在执行的时候会报错,是因为两个从节点缺少which命令,安装即可

分别在两个从节点执行下面命令安装

yum install -y which

再启动集群(如果集群已启动,需要先停止)

(13)验证集群是否正常 
首先查看进程: 

Hadoop0上需要有这几个进程

4643 Jps
4073 NameNode
4216 SecondaryNameNode
4381 ResourceManager

Hadoop1上需要有这几个进程

715 NodeManager
849 Jps
645 DataNode

Hadoop2上需要有这几个进程

456 NodeManager
589 Jps
388 DataNode

使用程序验证集群服务 
创建一个本地文件

vi a.txt
hello you
hello me

上传a.txt到hdfs上

hdfs dfs -put a.txt /

执行wordcount程序

cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out

查看程序执行结果 

这样就说明集群正常了。

通过浏览器访问集群的服务 
由于在启动hadoop0这个容器的时候把50070和8088映射到宿主机的对应端口上了

所以在这可以直接通过宿主机访问容器中hadoop集群的服务 

 

六. 安装Hive

我们使用Presto的hive connector来对hive中的数据进行查询,因此需要先安装hive.

1. 本地下载hive,使用下面的命令传到hadoop0上

docker cp ~/Download/hive-2.3.3-bin.tar.gz 容器ID:/

2. 解压到指定目录

tar -zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin /hive
cd /hive

3、配置/etc/profile,在/etc/profile中添加如下语句

export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile 

4、安装MySQL数据库

我们使用docker容器来进行安装,首先pull mysql image

docker pull mysql

启动mysql容器

docker run --name mysql -e MYSQL_ROOT_PASSWORD=111111 --net mynetwork --ip 172.18.0.5  -d

登录mysql容器

5、创建metastore数据库并为其授权

create database metastore;

6、 下载jdbc connector

下载地址Connector/J 5.1.43

下载完成之后将其解压,并把其中的mysql-connector-java-5.1.41-bin.jar文件拷贝到$HIVE_HOME/lib目录

7、修改hive配置文件

cd /hive/conf

7.1复制初始化文件并重改名

cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

7.2修改hive-env.sh

export JAVA_HOME=/usr/local/jdk1.8    ##Java路径
export HADOOP_HOME=/usr/local/hadoop   ##Hadoop安装路径
export HIVE_HOME=/usr/local/hive    ##Hive安装路径
export HIVE_CONF_DIR=/hive/conf    ##Hive配置文件路径

7.3在hdfs 中创建下面的目录 ,并且授权

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /user/hive/tmp
hdfs dfs -mkdir -p /user/hive/log
hdfs dfs -chmod -R 777 /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/tmp
hdfs dfs -chmod -R 777 /user/hive/log

7.4修改hive-site.xml

<property>
    <name>hive.exec.scratchdir</name>
    <value>/user/hive/tmp</value>
</property>
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
</property>

## 配置 MySQL 数据库连接信息
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://172.18.0.5:3306/metastore?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>111111</value>
  </property>

7.5 创建tmp文件

mkdir /home/hadoop/hive/tmp

并在hive-site.xml中修改:

把{system:java.io.tmpdir} 改成 /home/hadoop/hive/tmp/

把 {system:user.name} 改成 {user.name}

8、初始化hive

schematool -dbType mysql -initSchema

9、启动hive

hive

10. hive中创建表

新建create_table文件

REATE TABLE IF NOT EXISTS `default`.`d_abstract_event` ( `id` BIGINT, `network_id` BIGINT, `name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:49:25' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_bumper` ( `front_bumper_id` BIGINT, `end_bumper_id` BIGINT, `content_item_type` STRING, `content_item_id` BIGINT, `content_item_name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:05' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tracking` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `creative_id` BIGINT, `creative_name` STRING, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `placement_id` BIGINT, `placement_name` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_status` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `agency_id` BIGINT, `agency_name` STRING, `status` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_frequency_cap` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `frequency_cap` INT, `frequency_period` INT, `frequency_cap_type` STRING, `frequency_cap_scope` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_skippable` ( `id` BIGINT, `skippable` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `internal_id` STRING, `staging_internal_id` STRING, `budget_exempt` INT, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `ad_unit_type` STRING, `ad_unit_size` STRING, `placement_id` BIGINT, `placement_name` STRING, `placement_internal_id` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `io_internal_id` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_internal_id` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `advertiser_internal_id` STRING, `agency_id` BIGINT, `agency_name` STRING, `agency_internal_id` STRING, `price_model` STRING, `price_type` STRING, `ad_unit_price` DECIMAL(16,2), `status` STRING, `companion_ad_package_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_staging` ( `ad_tree_node_id` BIGINT, `adapter_status` STRING, `primary_ad_tree_node_id` BIGINT, `production_ad_tree_node_id` BIGINT, `hide` INT, `ignore` INT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_trait` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `trait_type` STRING, `parameter` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit_ad_slot_assignment` ( `id` BIGINT, `ad_unit_id` BIGINT, `ad_slot_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit` ( `id` BIGINT, `name` STRING, `ad_unit_type` STRING, `height` INT, `width` INT, `size` STRING, `network_id` BIGINT, `created_type` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_advertiser` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `agency_id` BIGINT, `agency_name` STRING, `advertiser_company_id` BIGINT, `agency_company_id` BIGINT, `billing_contact_company_id` BIGINT, `address_1` STRING, `address_2` STRING, `address_3` STRING, `city` STRING, `state_region_id` BIGINT, `country_id` BIGINT, `postal_code` STRING, `email` STRING, `phone` STRING, `fax` STRING, `url` STRING, `notes` STRING, `billing_term` STRING, `meta_data` STRING, `internal_id` STRING, `active` INT, `budgeted_imp` BIGINT, `num_of_campaigns` BIGINT, `adv_category_name_list` STRING, `adv_category_id_name_list` STRING, `updated_at` TIMESTAMP, `created_at` TIMESTAMP) COMMENT 'Imported by sqoop on 2017/06/27 09:31:22' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
cat create_table | hive

11. 启动metadata service

presto需要使用hive的metadata service

nohup hive --service metadata &

至此hive的安装就完成了。

 

七. 安装presto

1. 下载presto-server-0.198.tar.gz

2. 解压

cd presto-service-0.198
mkdir etc
cd etc

3. 编辑配置文件:

Node Properties 

etc/node.properties

node.environment=production
node.id=ffffffff-0000-0000-0000-ffffffffffff
node.data-dir=/opt/presto/data/discovery/

JVM Config

etc/jvm.config

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Config Properties

etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://hadoop0:8080

catalog配置:

etc/catalog/hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://hadoop0:9083
hive.config.resources=/usr/local/hadoop/etc/hadoop/core-site.xml,/usr/local/hadoop/etc/hadoop/hdfs-site.xml

4. 启动hive service

./bin/launch start

5. Download presto-cli-0.198-executable.jar, rename it to presto, make it executable with chmod +x, then run it:

./presto --server localhost:8080 --catalog hive --schema default

这样整个配置就完成啦。看一下效果吧,通过show tables来查看我们在hive中创建的表。

 

参考:

https://blog.csdn.net/xu470438000/article/details/50512442‘

http://www.jb51.net/article/118396.htm

https://prestodb.io/docs/current/installation/cli.html

 

posted @ 2018-04-11 15:08  Black_Knight  阅读(7236)  评论(0编辑  收藏  举报