Navigation: Research / TWIPA
Sunday, 2017-04-23

TWIPA - TWItter rePository mAnager

This page illustrates the motivation behind building TWIPA, the services it provides, the system overview, and the steps needed to run and utilize TWIPA.

Motivation

The availability of Twitter content through an easy-to-access public Twitter API has encouraged research communities to start a continuous crawling of tweets for analysis and knowledge discovery. This results in a huge repository of tweets that are stored in text files. Usually, the tweets published during a single day are stored in a separate file as a kind of repository-wise indexing to expedite the process of accessing tweets posted during a particular time window. On the other hand, when dealing with a streaming text having a high and continuous arrival rate, we are talking about terabytes of content, which is basically kept in compressed text files.

To process and analyze subsets of these tweets or even the entire repository, users have to follow a tedious procedure of preparing tweets for analysis. They locate the needed files, uncompress them, retrieve tweets based on spatial, temporal, and/or text queries, and finally transfer them to a database management system such as MongoDB. After that, a couple of indexing operations are to be accomplished depending on the application needs. This preparation procedure is repeated every time a new dataset is considered for analysis. Hence, a pipeline framework that performs the mentioned tasks is very helpful in saving a considerable preparation time and avoiding human errors. Users only need to run the framework, set a couple of parameters, and wait for the system until it finishes transferring the demanded tweets to MongoDB and indexing them.

TWIPA is a Java GUI-based tool to access a Twitter repository, to retrieve tweets based on spatio-temporal and/or text queries (selection), to choose a subset of a tweet's fields (projection), to store the tweets in MongoDB datasets, and to index the selected tweets.

TWIPA Services

TWIPA provides the following services:

  1. Hiding the details of the Twitter repository from users and providing an easy-to-use GUI to create datasets of tweets from this repository.
  2. Allowing users to pose rich spatio-temporal queries on the repository. In addition, keyword-based queries are provided to get the tweets containing at least one of the query kwywords. For example, a user can obtain tweets originating from a certain location, posted during a chosen time interval, and containing a specific hashtag.
  3. Enabling the selection of the desired fields from a tweet record. That is, the user does not have to migrate the entire set of a tweet's fields into MongoDB.
  4. Storing the tweets satisfying the user query in a MongoDB dataset. MongoDB is adapted to retain large volumes of semi-structured data, and to provide a fast access to its content.
  5. Indexing the dataset based on tweet ID, posting time, and/or geo-location.

System Overview

As can be seen in the figure below, the Twitter repository storing the crawled tweets consists of a number of sources, each of which represents a folder containing several Twitter files.  These sources can be distributed among multiple machines (servers) in order to utilize the available disk space of any machine in the local network. Each source holds tweets originating from a certain bounding box that covers a part of the geographic space. Therefore, some meta-data is associated with each source indicating the address of the source's server, the path within the server, and the source's geographic scope. The tweets in each source are split into text files organized in a daily manner, i.e., each file corresponds to one day and is named based on the format (yyyymmdd.json), or (yyyymmdd.json.gz) if compressed. The fields composing a retrieved tweet have a json-based format. Therefore, a typical text file of tweets contains a large number of records, such that each record corresponds to one tweet, is placed in a separate line, and has a json-based format.

Recall that the main objective of TWIPA is to select a subset of the tweets from the repository, project them by choosing a subset of the fields they contain, and store them in a MongoDB dataset. It is noteworthy that TWIPA can be configured to maintain the (address/port) information of multiple MongoDB instances. Each instance refers to one MongoDB process running under a certain machine in the local network (or even outside). This is important to split the disk space required for the created datasets among several machines as a kind of load balancing. On the other hand, a TWIPA user might prefer maintaining the created dataset in his/her local machine to avoid network delays when accessing  remote datasets.

To create a dataset, TWIPA first selects the sources and the Twitter files within each source, which obey the spatio-temporal query specified by a user. Then, TWIPA accesses the respective server of each source to retrieve the tweets of the selected files. These tweets are then stored in a synchronized queue, waiting for being processed by processing threads from the available thread pool. Each thread performs a further selection by choosing those tweets that exactly match the specified spatio-temporal (and perhaps keyword-based) queries. Moreover, each thread performs projection for the selected tweets in order to keep only the fields requested by the user. After that, the selected and projected tweets are accumulated in a buffer waiting for a bulk insert operation being executed to transfer these tweets to a particular MongoDB instance that the user chooses.

 

TWIPA Access

Before being able to access TWIPA, user should be granted a permission for connecting to TWIPA hosting server through the secure shell (ssh) command. For this type of permission, one first should obtain a (username/password) from our DB administration team by sending a request to (dbs-admin(at)informatik.uni-heidelberg.de). This authentication information is also needed to run TWIPA.

Once done, one can run TWIPA as follows:

1. Establishing a graphic-mode-enabled ssh connection to TWIPA server

           ssh -X leda.ifi.uni-heidelberg.de

2. Navigating to the following directory path:

          /data2/hamed/mongoDB/TWIPA/

3. Running TWIPA:

          ./runIt.sh

Dataset Creation Using TWIPA

To create a dataset under MongoDB, follow the following steps:

 [1]

Click on the third tab "Datasets". A list of existing (previously created) datasets is shown along with their respective meta-data. Now, click on the button "New Dataset" at the very bottom of the window.

[2]

Enter the dataset specifications (meta-data) that include: (a) Dataset name. (b) Bounding box: by clicking on the bounding box component, a pop-up map appears. Then, draw a rectangle over the area of interest. Once the mouse button is released, the corresponding corner coordinates are automatically entered into the system. (c) The start and end point of the required time interval. (d) Filtering keywords: a comma-separated list of keywords are optionally provided to select tweets containing at least one of these keywords. (e) Indexing fields: TWIPA currently allows for indexing the created dataset by tweet id, geo-coordinates, and creation time. (f) MongoDB instance: used to choose the MongoDB instance referring to the process that will be used to maintain the created dataset. Moreover, TWIPA enables the creation of a compressed file containing the selected and projected tweets when user does not want to deal with MongoDB. In this case, choose the item (0- To File) from the MongoDB instance drop-down list.

 

[3]

Select the tweet fields. For example, if only the (id, text, created_at) fields are needed from each tweet, highlight such fields and press the forward button to move them to the list on the right side as shown in the figure below.

[4]

Click "Finish" to move to the "Dataset Creation" window. The selected dataset specifications are summarized and listed, and at the bottom, the "Start Creation" button is shown. Once clicked, the dataset creation process starts and the button's caption changes to "Cancel". One can interrupt the creation process at any time by clicking this button again. This is essential to delete the already inserted content, to perform a proper stopping of the running threads, and to allow the user to re-enter the dataset specification if needed.

The speed of processing tweets using TWIPA depends highly on the percentage of selected tweets. The higher the percentage is, the more time needed to conduct tweet projection, buffering, and transferring to MongoDB through the local network. Therefore, the worst running time scenario arises when a bounding box covering the entire world is chosen. For this specific case, the time needed to process 1 million tweets is about 2.3 minutes in average. On the other hand, when one is interested in obtaining tweets originating from Germany, for example,  the time needed to process the same number of tweets is less than 1 minute as can be inferred from the statistics shown in the figure below.

 

Contact

For further information, please contact Hamed Abdelhaq.

 

Letzte Änderung: 06.01.2015