Navigation: Resources / Data
Monday, 2017-04-24

Data Sets

List of data sets available for download.

Online News Citation Network

The Online News Citation Network is a sparse citation network between news articles (297,192 edges, 209,207 nodes). It contains citations between online news articles of major German and British news outlets, starting in June 2014. For the creation of the network, only those links between articles were extracted that are embedded in the articles' texts. The data set contains no content of the original articles, but URIs to each individual article are provided. Note that news outlets were added not at once but in steps. Therefore, there exists more data for some outlets and less for others. Most notably, British news outlets were only added starting in July 2015. The data set consists of three files:

newscit_nodes.txt

Format: <article id> <publication date> <news outlet id> <feed type> <article URI>

  • <article id>: article id as an integer value. Note that article ids are not consecutive and not necessarily time ordered.
  • <publication date>: publication date of the article in ISO 8601 format (UTC timezone). In cases where an article is missing a publication date, the date of retrieval is used as substitute.
  • <news outlet id>: name of the newspaper or news source that published the article
  • <feed type>: type of feed that the article was published in (e.g. politics)
  • <article URI>: URI of the article as a http link
newscit_edges.txt

Format: <source article id> <target article id>

  • <source article id>: id of the article that contains a reference link in its text
  • <target article id>: id of the article that is being referenced
newscit_feeds.txt

Format: <news outlet> <feed type> <RSS feed URI> <language> <start date> <article count>

  • <news outlet>: name of the newspaper or news source
  • <feed type>: type of feed that the article was published in (e.g. politics)
  • <RSS feed URI>: URI of the RSS feed
  • <language>: language of the news outlet (English or German)
  • <start date>: date of the first addition of the feed to the data set
  • <article count>: number of articles from this feed in the data set

Download data as zip (9.8 MB compressed, 36 MB uncompressed), data version of November 25, 2016).
[previous versions: 07.2015, 08.2015]

Further details on the network's creation can be found in the original article. If you use the data for research purposes, please cite this as the source:

A. Spitz and M. Gertz.
Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles.
Advances in social networks analysis and mining (ASONAM), 2015
[pdf] [acm] [bibtex] [slides]

 


U.S. Election Days

The data set contains the dates of all U.S. Election Days, which occur annually on the first Tuesday after the first Monday in November (i.e., it varies between November 2 and November 8). Every four years (in years divisible by 4), the Election Day is a presidential Election Day. The data is formatted as a tab-separated txt file with two columns.

Format: <date> \t <type>

  • <date>: date of the Election Day as YYYY-MM-DD
  • <type>: regular or presidential

Download data as zip

The data was originally used as evaluation set for similarity prediction between dates. If you use the data for research purposes, please cite this as the source:

A. Spitz, J. Strötgen, T. Bögel, and M. Gertz.
Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model.
5th Temporal Web Analytics Workshop, Proceedings of the 24th International Conference on World Wide Web Companion (WWW), pages 1375-1380, 2015

 


Wikipedia Social Network

The Wikipedia social network is a weighted, large-scale network of roughly 800k nodes and 70M edges that is constructed from person mentions in the English Wikipedia corpus. Network construction is performed based on co-occurrences of person mentions within the entire Wikipedia corpus and edges are weighted based on the distances of person mentions within the text, before they are aggregated over all co-occurrences of mentions to collapse parallel edges and create a simple graph. The network is designed as a stand-in for large-scale real-world social networks that are otherwise unavailable at this time. Additionally, the network is enriched with community information (e.g. Wikipedia category membership of persons) and additional node attributes (e.g. Wikidata information about professions or the date of birth / death). The network has been shown to possess characteristic structural features of traditional small scale social networks, while its nodes correspond to contemporal and historic figures as opposed to currently used, anonymized large scale social media data sets. The data set consists of four files:

wsn_edgelist-dicos.txt

Contains all edges of the Wikipedia Social Network. Note:
This is the only file that uses a whitespace as separator instead of a tab.
Format: <source person id> <target person id> <edge weight>

wsn_category-person.txt

Contains information of category membership for persons in the network.
Format: <category id> \t <category name> [\t <person id>]*

wsn_person-name-gender-birth-death.txt

Contains additional information for persons in the network
Format: <person id> \t <full name> \t <gender> \t <year of birth> \t <year of death>

wsn_person-occupation.txt

Occupation information for persons in the network. Note:
If a person has multiple occupations, it will occur in multiple rows.
Format: <person id> \t <occupation>

Download data as tar.gz (963 MB compressed, 2.7 GB uncompressed)

Detailed information on the network's creation can be found in the original article. If you use the data for research purposes, please cite this as the source:

J. Geiß, A. Spitz and M. Gertz.
Beyond Friendships and Followers: The Wikipedia Social Network.
Advances in social networks analysis and mining (ASONAM), 2015

 


Wikipedia Location Network

The Wikipedia location network is a weighted, undirected large-scale network of roughly 720k nodes and 178M edges that is constructed from toponyms in the English Wikipedia corpus. Network construction is performed based on co-occurrences of location mentions within the entire Wikipedia corpus and edges are weighted based on the distances of toponyms within the text, before they are aggregated over all co-occurrences of mentions to collapse parallel edges and create a simple graph. The network is enriched with hierarchical information in granularities city, country and continent and additional node attributes (e.g. Wikidata information about places). The data set consists of three files:

wln_edgelist-dicos.txt

Contains all edges of the Wikipedia Location Network. Note: This is the only file that uses a whitespace as separator instead of a tab.
Format: <source location id> <target location id> <edge weight>

wln_label-type.txt

Contains additional information for locations in the network. Note that the IDs of locations correspond to Wikidata ID and can be transformed by adding Q in front of the ID. The location label is the Englis name of the entity in Wikidata. Location types are "POI", "city", "continent", "country", "mountain", "mountain range", "river" and "NIL" if no information is available.
Format: <location id> \t <location label> \t <location type>

wln_hierarchy.txt

Contains hierarchical information for locations in the network. Hierarchies given are "continent" and "country". Since locations can belong to multiple countries or continents according to Wikidata, the entries are lists of location IDs. Multiple entries in a list are separated by commata and lists are surrounded by []. For some locations (such as countries), only the continent is known. In these cases, the entry only has 2 columns instead of 3.
Format: <location id> \t [<continent id>*] \t [<country id>*]

Download data as tar.gz (1.4 GB compressed, 6.2 GB uncompressed)

More information on the network's creation can be found in the original article. If you use the data for research purposes, please cite this as the source:

J. Geiß, A. Spitz, J. Strötgen and M. Gertz.
The Wikipedia Location Network - Overcoming Borders and Oceans.
9th Workshop on Geographic Information Retrieval (GIR), 2015

 


Wikipedia Event Triple Network

This is a collection of event triples that were extracted from the English Wikipedia dump of May 2015. Entities are of type actor, location and date. Actors and locations were annotated by matching Wikipedia links to Wikidata items. Dates were annotated with Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences. The data set consists of two files:

EventTriplesWS2_nodes.csv (1,223,415 lines, 43 MB)
Format: <node id>, <node type>, <node label>, <wikidata label>, <degree>

  • <node id>: internal node ID used in the Wikidata LOAD network
  • <node type>: type of the node (actor, location, date)
  • <node label>: Wikidata ID (excluding prefix Q)
  • <wikidata label>: English label of the node in Wikidata
  • <degree>: degree of the node in the event network


EventTriplesWS2_triples.csv (112,986,092 lines, 5.1 GB)
<locationID>, <actorID>, <dateID>, <averageSimilarity>, <minSimilarity>, <occurrenceCount>

  • <locationID>: node id of the location component of the triple
  • <actorID>: node id of the actor component of the triple
  • <dateID>:  node id of the date component of the triple
  • <averageSimilarity>: aggregated sum of average of the similarities of all three edges in the triple
  • <minSimilarity>: aggregated sum of minima of similarities of all three edges in the triple
  • <occurrenceCount>: total number of times this triple appears in Wikipedia inside the window size

Download data as zip (468 MB)

Further details on the network's creation can be found in the original article.
If you use the data for research purposes, please cite this as the source:

A. Spitz, J. Geiß, M. Gertz, S. Hagedorn and K. Sattler
Refining Imprecise Spatio-temporal Events: A Network-based Approach.
10th ACM SIGSPATIAL Workshop on Geographic Information Retrieval (GIR), 2016.

 

 


 

Letzte Änderung: 29.11.2016