使用mrjob进行数据处理
使用mrjob进行数据处理
样例
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
总结,处理k,v的过程
使用hadoop测试是否可用
- 参考https://github.com/kiwenlau/hadoop-cluster-docker
- 安装hadoop
- 安装mrjob
- 跑测试脚本
结果成功 注意要hadoop本地运行