Data Sets

List of data sets available for download.

Online News Citation Network

The Online News Citation Network is a sparse citation network between news articles (724,285 edges, 525,797 nodes). It contains citations between online news articles of major German and British news outlets, starting in June 2014. For the creation of the network, only those links between articles were extracted that are embedded in the articles' texts. The data set contains no content of the original articles, but URIs to each individual article are provided. Note that news outlets were added not at once but in steps. Therefore, there exists more data for some outlets and less for others. Most notably, British news outlets were only added starting in July 2015. The data set consists of three files:

newscit_nodes.txt

Format: <article id> <publication date> <news outlet id> <feed type> <article URI>

  • <article id>: article id as an integer value. Note that article ids are not consecutive and not necessarily time ordered.
  • <publication date>: publication date of the article in ISO 8601 format (UTC timezone). In cases where an article is missing a publication date, the date of retrieval is used as substitute.
  • <news outlet id>: name of the newspaper or news source that published the article
  • <feed type>: type of feed that the article was published in (e.g. politics)
  • <article URI>: URI of the article as a http link
newscit_edges.txt

Format: <source article id> <target article id>

  • <source article id>: id of the article that contains a reference link in its text
  • <target article id>: id of the article that is being referenced
newscit_feeds.txt

Format: <news outlet> <feed type> <RSS feed URI> <language> <start date> <article count>

  • <news outlet>: name of the newspaper or news source
  • <feed type>: type of feed that the article was published in (e.g. politics)
  • <RSS feed URI>: URI of the RSS feed
  • <language>: language of the news outlet (English or German)
  • <start date>: date of the first addition of the feed to the data set
  • <article count>: number of articles from this feed in the data set

Download data as zip (24.4 MB compressed, 92 MB uncompressed, data version of November, 2018).
[previous versions: 07.2015, 08.2015, 11.2016, 06.2017]

Further details on the network's creation can be found in the original article. If you use the data for research purposes, please cite this as the source:

  • Andreas Spitz, and Michael Gertz.
    Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles.
    In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (ASONAM '15), Paris, France, August 25 - 28. 2015, 274–279
    [pdf] [acm] [bibtex] [slides]

 


Temporal News Citation Network

This version of the online news citation network was annotated for temporal expressions within the article body to investigate the prediction of article publication dates as a regression task, purely by using the network topology of the news citation network. The data set covers English news articles between November 1, 2015 and October 31, 2017. Temporal expressions in the article text are annotated with the HeidelTime temporal tagger. URIs to the original article contents are included.

Data

The data set consists of six files:


data_newscit_nodes.txt
Format: <article id> <publication date> <news outlet id> <feed type> <article URI>

  • <article id>: article id as an integer value. Note that article ids are not consecutive and not necessarily time ordered.
  • <publication date>: publication date of the article in ISO 8601 format (UTC timezone). In cases where an article is missing a publication date, the date of retrieval is used as substitute.
  • <news outlet id>: name of the newspaper or news source that published the article
  • <feed type>:  type of feed that the article was published in (e.g. politics)
  • <article URI>: URI of the article as a http link

data_newscit_edges.txt
Format: <target article id> <source article id>

  • <target article id>: id of the article that contains a reference link in its text
  • <source article id>: id of the article that is being referenced

data_newscit_feeds.txt
Format: <news outlet> <feed type> <RSS feed URI> <language> <start date> <article count>

  • <news outlet>: name of the newspaper or news source
  • <feed type>: type of feed that the article was published in (e.g. politics)
  • <RSS feed URI>: URI of the RSS feed
  • <language>: language of the news outlet (English or German)
  • <start date>: date of the first addition of the feed to the data set
  • <article count>: number of articles from this feed in the data set

data_newscit_outlets.txt
Format: <outlet abbreviation> <outlet id> <start date> <outlet name>

  • <outlet abbreviation>: 2-3 character abbreviation of the news outlet
  • <outlet id>: internally used id string of the news outlet in the data
  • <start date>: first date for an article in the data set (first crawling date)
  • <outlet name>: full name of the news outlet

data_newscit_tempExp.txt
Format: <article id> <sentence id> <temporal expression>

  • <article id>: integer id of the article containing this expression
  • <sentence id>: integer id of the sentence containing this article (n-th sentence in the document)
  • <temporal expression>: normalized value YYYY-MM-DD of the temporal expression. Only dates are considered here

data_newscit_features.txt
Full set of all features used for prediction, as produced by the script file r02_feature_generation.r


Download the data set as zip (30 MB compressed, 131 MB uncompressed)

Code

This archive contains all code that was used in predicting and evaluating document creation times within the news citation network. The contained script files can be executed in the R software environment by placing the scripts in the same folder as the network data provided above.


Download the code as zip


If you use the data or the code in your research, please cite the following article as the source:

  • Andreas Spitz, Jannik Strötgen, and Michael Gertz.
    Predicting Document Creation Times in News Citation Networks.
    In: Proceedings of the 27th International Conference on World Wide Web (WWW '18) Companion, Lyon, France, April 23-27
    [pdf] [acm] [bibtex] [slides]

 


U.S. Election Days

The data set contains the dates of all U.S. Election Days, which occur annually on the first Tuesday after the first Monday in November (i.e., it varies between November 2 and November 8). Every four years (in years divisible by 4), the Election Day is a presidential Election Day. The data is formatted as a tab-separated txt file with two columns.

Format: <date> \t <type>

  • <date>: date of the Election Day as YYYY-MM-DD
  • <type>: regular or presidential

Download data as zip

The data was originally used as evaluation set for similarity prediction between dates. If you use the data for research purposes, please cite this as the source:

  • Andreas Spitz, Jannik Strötgen, Thomas Bögel, and Michael Gertz.
    Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model.
    In: Proceedings of the 24th International Conference on World Wide Web (WWW '15) Companion, Florence, Italy, May 18-22. 2015, 1375–1380
    [pdf] [acm] [bibtex] [slides]

 


LOAD Networks

LOAD networks are large-scale graph embeddings that are constructed from the implicit relations that can be extracted from the co-occurrences of named entities in the text of document collections. The name derives from the focus on locations, organizations, actors (persons) and dates as named entities. Such networks are used, for example, in the exploration of entity and event mentions in the online interface EVELIN. For a collection of LOAD network data sets, see the LOAD web page.



Wikipedia Glossary Entries

The data set contains glossary entries extracted from Wikipedia for terms from the fields of astronomy, biology, chemistry and geology. The data is formatted as a tab-separated txt file with three columns:

Format: <wikidata ID> \t <entity label> \t <glossary description>

  • <wikidata ID>: Wikidata identifier of the entity
  • <entity label>: English Wikidata label of the entity
  • <glossary description>: Wikipedia description of the entity in the glossary

Download data as zip

The data was originally used as evaluation set for the extraction of descriptive sentences for entities and entity relations. If you use the data set for research, please cite this as the source:

  • Andreas Spitz, Gloria Feher, and Michael Gertz.
    Extracting Descriptions of Location Relations from Implicit Textual Networks.
    In: Proceedings of the 11th Workshop on Geographic Information Retrieval (GIR '17) Heidelberg, Germany, November 30 - December 1. 2017, 1:1–1:9
    [pdf] [acm] [bibtex] [slides]

 


Wikipedia Social Network

The Wikipedia social network is a weighted, large-scale network of roughly 800k nodes and 70M edges that is constructed from person mentions in the English Wikipedia corpus. Network construction is performed based on co-occurrences of person mentions within the entire Wikipedia corpus and edges are weighted based on the distances of person mentions within the text, before they are aggregated over all co-occurrences of mentions to collapse parallel edges and create a simple graph. The network is designed as a stand-in for large-scale real-world social networks that are otherwise unavailable at this time. Additionally, the network is enriched with community information (e.g. Wikipedia category membership of persons) and additional node attributes (e.g. Wikidata information about professions or the date of birth / death). The network has been shown to possess characteristic structural features of traditional small scale social networks, while its nodes correspond to contemporal and historic figures as opposed to currently used, anonymized large scale social media data sets. The data set consists of four files:

wsn_edgelist-dicos.txt

Contains all edges of the Wikipedia Social Network. Note:
This is the only file that uses a whitespace as separator instead of a tab.
Format: <source person id> <target person id> <edge weight>

wsn_category-person.txt

Contains information of category membership for persons in the network.
Format: <category id> \t <category name> [\t <person id>]*

wsn_person-name-gender-birth-death.txt

Contains additional information for persons in the network
Format: <person id> \t <full name> \t <gender> \t <year of birth> \t <year of death>

wsn_person-occupation.txt

Occupation information for persons in the network. Note:
If a person has multiple occupations, it will occur in multiple rows.
Format: <person id> \t <occupation>

Download data as tar.gz (963 MB compressed, 2.7 GB uncompressed)

Detailed information on the network's creation can be found in the original article. If you use the data for research purposes, please cite this as the source:

  • Johanna Geiß, Andreas Spitz, and Michael Gertz.
    Beyond Friendships and Followers: The Wikipedia Social Network.
    In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (ASONAM '15), Paris, France, August 25 - 28. 2015, 472–479
    [pdf] [acm] [bibtex]

 


Wikipedia Location Network

The Wikipedia location network is a weighted, undirected large-scale network of roughly 720k nodes and 178M edges that is constructed from toponyms in the English Wikipedia corpus. Network construction is performed based on co-occurrences of location mentions within the entire Wikipedia corpus and edges are weighted based on the distances of toponyms within the text, before they are aggregated over all co-occurrences of mentions to collapse parallel edges and create a simple graph. The network is enriched with hierarchical information in granularities city, country and continent and additional node attributes (e.g. Wikidata information about places). The data set consists of three files:

wln_edgelist-dicos.txt

Contains all edges of the Wikipedia Location Network. Note: This is the only file that uses a whitespace as separator instead of a tab.
Format: <source location id> <target location id> <edge weight>

wln_label-type.txt

Contains additional information for locations in the network. Note that the IDs of locations correspond to Wikidata ID and can be transformed by adding Q in front of the ID. The location label is the Englis name of the entity in Wikidata. Location types are "POI", "city", "continent", "country", "mountain", "mountain range", "river" and "NIL" if no information is available.
Format: <location id> \t <location label> \t <location type>

wln_hierarchy.txt

Contains hierarchical information for locations in the network. Hierarchies given are "continent" and "country". Since locations can belong to multiple countries or continents according to Wikidata, the entries are lists of location IDs. Multiple entries in a list are separated by commata and lists are surrounded by []. For some locations (such as countries), only the continent is known. In these cases, the entry only has 2 columns instead of 3.
Format: <location id> \t [<continent id>*] \t [<country id>*]

Download data as tar.gz (1.4 GB compressed, 6.2 GB uncompressed)

More information on the network's creation can be found in the original article. If you use the data for research purposes, please cite this as the source:

  • Johanna Geiß, Andreas Spitz, Jannik Strötgen, and Michael Gertz.
    The Wikipedia Location Network: Overcoming Borders and Oceans.
    In: Proceedings of the 9th Workshop on Geographic Information Retrieval (GIR '15), Paris, France, November 26-27. 2015, 2:1–2:3
    [pdf] [acm] [bibtex] [slides]



Wikipedia Event Triple Network

This is a collection of event triples that were extracted from the English Wikipedia dump of May 2015. Entities are of type actor, location and date. Actors and locations were annotated by matching Wikipedia links to Wikidata items. Dates were annotated with Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences. The data set consists of two files:

EventTriplesWS2_nodes.csv (1,223,415 lines, 43 MB)
Format: <node id>, <node type>, <node label>, <wikidata label>, <degree>

  • <node id>: internal node ID used in the Wikidata LOAD network
  • <node type>: type of the node (actor, location, date)
  • <node label>: Wikidata ID (excluding prefix Q)
  • <wikidata label>: English label of the node in Wikidata
  • <degree>: degree of the node in the event network


EventTriplesWS2_triples.csv (112,986,092 lines, 5.1 GB)
<locationID>, <actorID>, <dateID>, <averageSimilarity>, <minSimilarity>, <occurrenceCount>

  • <locationID>: node id of the location component of the triple
  • <actorID>: node id of the actor component of the triple
  • <dateID>:  node id of the date component of the triple
  • <averageSimilarity>: aggregated sum of average of the similarities of all three edges in the triple
  • <minSimilarity>: aggregated sum of minima of similarities of all three edges in the triple
  • <occurrenceCount>: total number of times this triple appears in Wikipedia inside the window size

Download data as zip (468 MB)

Further details on the network's creation can be found in the original article.
If you use the data for research purposes, please cite this as the source:

  • Andreas Spitz, Johanna Geiß, Michael Gertz, Stefan Hagedorn, and Kai-Uwe Sattler.
    Refining Imprecise Spatio-temporal Events: A Network-based Approach.
    In: Proceedings of the 10th Workshop on Geographic Information Retrieval (GIR '16), Burlingame, California, USA, October 31. 2016, 5:1–5:10
    [pdf] [acm] [bibtex] [slides]