sparkStreaming结合flume操作流程
依赖包导入lib文件夹,配置文件conf修改(source,channel,sink三个部分)
(1)安装flume1.6以上
(2)下载依赖包
spark-streaming-flume-sink_2.11-2.0.2.jar放入到flume的lib目录下
(3)写flume的agent,注意既然是拉取的方式,那么flume向自己所在的机器上产数据就行
(4)编写flume-poll/push.conf配置文件
1 flume push方式(只能向一台机器推数据,很有局限)
编写flume-push.conf配置文件
push mode
a1.sources = r1
a1.sinks = k1
a1.channels = c1
source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/data
a1.sources.r1.fileHeader = true
channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=172.16.43.63
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000
注意配置文件中指明的hostname和port是spark应用程序所在服务器的ip地址和端口。
#首先启动spark-streaming应用程序
#再
flume-ng agent -n a1 -c /export/servers/flume/conf -f /export/servers/flume/conf/flume-poll-spark.conf -Dflume.root.logger=INFO,console
生产数据命令: while true;do echo hadoop hadoop spark>>/root/data.txt;sleep 2;done
2 flume poll方式(可以配置多台flume)
编写flume-poll.conf配置文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/data
a1.sources.r1.fileHeader = true
#channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
#sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=hdp-node-01
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000
#首先将下载好的spark-streaming-flume-sink_2.10-2.0.2.jar放入到flume的lib目录下
#启动flume
bin/flume-ng agent -n a1 -c /export/servers/flume/conf -f /export/servers/flume/conf/flume-poll-spark.conf -Dflume.root.logger=INFO,console
#再启动spark-streaming应用程序