Parallelizing Text Processing

From SemanticLab

Jump to: navigation, search

The goal of this work is to parallelize important text processing tasks such as Text Preprocessing and Co-occurrence analysis using the hadoop Map-/Reduce Framework.

Tasks

  • familiarize yourself with the hadoop map-/reduce framework
  • create a hadoop hello-world application
  • transfer the text cleanup & pre-processing components to map-/redcue
  • transfer the co-occurrence components to map-/reduce


Table of Contents

  • Introduction
  • Theoretical Background
    • Map-/Reduce
    • Natural Language Detection
      • Text Preprocessing
      • Co-occurrence analysis
  • Method
  • Implementation
  • Evaluation
  • Outlook and Conclusions
Personal tools