使用mrjob进行数据处理

使用mrjob进行数据处理

样例


"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MRWordFreqCount.run()


总结,处理k,v的过程

使用hadoop测试是否可用

  1. 参考https://github.com/kiwenlau/hadoop-cluster-docker
  2. 安装hadoop
  3. 安装mrjob
  4. 跑测试脚本

结果成功 注意要hadoop本地运行

Loading Disqus comments...
Table of Contents