Thursday, 2014-04-24

Temporal Tagging

Welcome to the temporal tagging page of the DBS group at Heidelberg University. On this page, you can find:


HeidelTime is now at Google Code. There, you can find further information on HeidelTime as well as download links.

HeidelTime is a multilingual cross-domain temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard, which is part of the markup language TimeML (with focus on the "value" attribute). HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed: news, narratives (e.g., Wikipedia articles), colloquial (e.g., SMS, tweets), and scientific (e.g., biomedical studies). It is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax.

HeidelTime is one component of our UIMA HeidelTime kit (see, UIMA HeidelTime kit). Furthermore, we implemented a UIMA-independent standalone version. Both versions are available on our download page.

We developed English (English, English-colloquial, and English-scientific) and German resources for HeidelTime. In addition, we included resources for Dutch, which were kindly provided by Matje van de Camp (Tilburg University). Due to missing preprocessing abilities for Dutch in the UIMA pipeline, these resources can only be used with the standalone version so far, but we are currently adapting the preprocessing components of the UIMA HeidelTime kit.

HeidelTime was the best system for the extraction and normalization of English temporal expressions from documents in the TempEval-2 challenge. Furthermore, it is evaluated on several additional corpora, as described in our paper "Multilingual Cross-domain Temporal Tagging". All of HeidelTime's evaluation results are reproducible.

If you want to get an impression of HeidelTime, you may want to try our HeidelTime online demo.

UIMA HeidelTime kit

In addition to HeidelTime with resources for English, German, and Dutch, the UIMA HeidelTime kit contains collection readers, cas consumers, and analysis engines to process temporal annotated corpora and reproduce HeidelTime's evaluation results on these corpora. The HeidelTime kit contains:

  • ACE Tern Reader: This Collection Reader reads corpora of the ACE Tern style and annotates the document creation time. Using the corpora preparation script (for details, see below), the following corpora can be processed: ACE Tern 2004 training, ACE Tern 2005 training, TimeBank, WikiWars, and WikiWarsDE.
  • TempEval-2 Reader: This Collection Reader reads the TempEval-2 input data of the training and the evaluation sets and annotates the document creation time as well as token and sentence information.
  • HeidelTime (see above)
  • Annotation Translator: This Analysis Engine translates Sentence, Token, and Part-of-Speech annotations of one type system into HeidelTime's type system.
  • ACE Tern Writer: This CAS Consumer creates output data as needed to run the official ACE Tern evaluation scripts.
  • TempEval-2 Writer: This CAS Consumer creates output data as needed to evaluate the tasks of extracting and evaluating temporal expressions of the TempEval-2 challenge using the official evaluation scripts.

The UIMA HeidelTime kit including a detailed documentation can be downloaded here. The documentation is doc/readme.txt. The documentation about HeidelTime's rule syntax is preliminary but will be updated soon. Furthermore, HeidelTime and its rule syntax are described in our journal paper on "Multilingual and Cross-domain temporal tagging".
If you want to receive notifications on updates of HeidelTime, please send me an email or check the notification box in the download form.

Temporal Annotated Corpora


Motivated by the development of WikiWars (Pawel Mazur and Robert Dale, 2010: A New Corpus for Research on Temporal Expressions) and the fact that there was no temporal annotated corpus for German publicly available so far, we developed WikiWarsDE. It contains the corresponding German articles of the English corpus, i.e., 22 articles about famous wars in history annotated with temporal expressions according to the TIMEX2 annotation standard (2005). In comparison to other temporal annotated corpora containing news documents, the documents of WikiWars and WikiWarsDE are much longer and contain many temporal expressions. Thus, the temporal discourse structure is much more complicated and very challenging for temporal taggers. Details on the corpus are described in our paper "WikIWarsDE: A German Corpus of Narratives Annotated with Temporal Expressions".

WikiWarsDE is available on our download page.

Time4SMS & Time4SCI

Temporal tagging on different domains poses different challenges. For example, in colloquial texts (such as SMS or tweets), a temporal tagger has to deal with non-standard language, and in scientific documents (e.g., biomedical publications), temporal expressions often refer to a local time frame instead of a "real" points/intervals in time. To study challenges and to test strategies on different domains, we developed the temporal annotated corpora Time4SMS and Time4SCI. Time4SMS contains 1000 SMS from the NUS SMS corpus and Time4SCI 50 PubMed abstracts about clinical trials. Details on these corpora are described in our paper "Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards".

Time4SMS and Time4SCI are available on our download page.  

Corpora Preparation and Evaluation Scripts

Here, we provide scripts to transform temporal annotated corpora as they are publicly available into the ACE Tern style format (TempEval2 style format). Then, they are directly accessible by our ACE Tern Collection Reader (TempEval2 Reader). Furthermore, scripts to automate the evaluation process are provided. All scripts needed for corpora preparation and evaluation are available on our download page. Details how to use these scripts, are described in the next section.

Reproducing HeidelTime's Evaluation Results

As already mentioned above, HeidelTime is evaluated on several corpora as described in our papers "Multilingual Cross-domain Temporal Tagging" and "Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards". Details on the evaluation results for the most recent stable version of HeidelTime as well as a detailed description how to reproduce the results are provided on the HeidelTime Google Code project site


