使用mrjob进行数据处理

样例

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MRWordFreqCount.run()

总结，处理k,v的过程

使用hadoop测试是否可用

参考https://github.com/kiwenlau/hadoop-cluster-docker
安装hadoop
安装mrjob
跑测试脚本

结果成功注意要hadoop本地运行