Text Analytics (ITA)

This is a new MSc-level class focusing on the various methods, techniques, and tools for text analytics from a Data Science perspective. We start off with the phases of text analytics projects, continue with text processing tasks and the various ways to text representation up to text analytics models, both traditional ones as well as those based on advanced techniques such as neural networks.


This course is programming intensive! In addition to a final group project there will be assignments that focus on conceptual (theoretical) aspects of text analytics models but often concentrate on mapping these concepts into programs and text analytics pipelines, using real world text corpora. For the projects, we will exclusively use Python and frameworks such as NLTK, spaCy, gensim, Scikit-Learn, and Keras. If you are uncomfortable with Python and programming in general, this class is likely not the best fit for you.


Below you will find some more information about this class based on the module handbook (course is not included there yet).


Time and Location:

  • Lectures: M 2-4pm, R 9-11am, online using WebEx (link will be posted via Müsli)
  • Exercise Sessions: R 2-4pm, online using WebEx

First Lecture: M, November 2nd (tentative)


Applicable to courses of study: BSc Applied Computer Science, MSc Applied Computer Science, MSc Scientific Computing


Class Web Site: in Moodle; Link and Key will be provided through Müsli mailing list prior to the first lecture


Course Objectives: Students

  • can implement and apply different text analytics methods using open source NLP and machine learning frameworks
  • can describe different document and text representation models and can compute and analyze characteristic parameters of these models
  • know how to determine, apply, and interpret use-case specific document similarity measures and underlying ranking concepts
  • know the concepts and techniques underlying different text classification and clustering approaches
  • know different models for phrase extraction and text summarization and are able to apply respective models and concepts using NLP and machine learning frameworks
  • know the fundamental methods for the extraction of document outlines at different levels of granularity
  • are familiar with basic concepts of topic models and their application in different text analytics tasks
  • understand the principles of evaluating results of text analytics tasks
  • know the theoretical background of machine learning methods at sufficient depths to be able to choose parameters and adapt an algorithm to a given text analytics problem
  • are aware of ethical issues arising from applying text analytics in different domains

Course Content:

  • Text analytics in the context of Data Science
  • Open source text analytics, NLP, and machine learning frameworks
  • Fundamentals of NLP pipeline components
  • Text Analytics project management 
  • Document and text representation models
  • Document and text similarity metrics
  • Approaches, techniques and corpora for benchmarking text analytics tasks
  • Traditional and recent text classification and clustering approaches
  • Information extraction and topic detection approaches
  • Fundamentals of keyword and phrase extraction
  • Text summarization techniques
  • Generating document and text outlines
  • Ethical and legal aspects of text analytics methods

Suggested Prerequisites: Recommended are solid knowledge of basic calculus, statistics, and linear algebra; good Python programming skills


Assessments: Assignments (40%) and Programming Project (60%); about 4-5 assignments focusing on the material learned in class on a conceptual and formal level; group project in which 3-4 students develop a prototypical text analytics framework, including design and evaluation, a written project documentation as well as the code need to be submitted at the end of classes, clearly indicating what student is responsible for what part of the project. Both assignments and project must be at least satisfactory (4,0) in order to pass the class.

Instructors

  • Prof. Dr. Michael Gertz
  • Satya Almasian (Research and Teaching Assistant)
  • Dennis Aumiller (Research and Teaching Assistant)