obama/osama tweets

l obama/osama tweets

Sonification of Twitter Messages

Background (Patrick Herron)

Late on 01 May 2011 the the President of the United States, Barak Obama, announced that US military forces had killed wanted terrorist and Al Qaeda leader Osama Bin Laden. As is the case with many news stories, the social network, Twitter, was filled with messages for weeks about the reported killing of Bin Laden.

On 06 May 2011 I launched a Tweet capture system and began tracking all Tweets containing "Osama." On 10 May 2011 I began tracking all tweets containing "Obama." Within a few days I had captured at least 1.5 million tweets containing at least one of those two terms before my server's hard drive filled up.

The following data files each contain 5,000 Twitter posts sampled from the above-described tweets that matched the terms "obama" and "osama" beginning Tue, 10 May 2011 16:00:11 GM. The .txt files are tab-delimited. The first 10,000 tweets in english (language code = EN) for each term beginning at UNIX time stamp 1305043211 (Tue, 10 May 2011 16:00:11 GMT) were downloaded from the YourTwapperKeeper server. A sample of 5000 of those tweets was extracted using RapidMiner 5.1 The tweet text was subsequently rendered into a sparse matrix/vector space model. All terms that occurred fewer than 50 times or more than 5000 times were removed. An additional manual cleaning to remove additional unforeseen stop words was performed. Twitter metadata and the original tweet text were also preserved for each row in the vector space model. The vector setting in RapidMiner selected was "term frequency." Term frequency is calculated in RapidMiner using the following code:

int numTerms = wordList.size();
        double totalTermNumber = 0;
        for (float value: frequencies)
	totalTermNumber += value;
        
        // Create the result structure
        double[] wv = new double[numTerms];

        // If document contains at least one term
        if (totalTermNumber > 0) {
            // Create the vector
            double length = 0.0;
            for (int i = 0; i < wv.length; i++) {
                wv[i] = frequencies[i] / totalTermNumber;
                length += wv[i] * wv[i];
            }

            length = Math.sqrt(length);

            // Normalize the vector
            if (length > 0.0)
                for (int i = 0; i < wv.length; i++)
                    wv[i] = wv[i] / length;
        }
        return wv;

As you can see from the above code snippet, the frequency of each term in a tweet is normalized by dividing the frequency of the term by the square root of the sum of each term's squared-frequency, commonly known as the Euclidean norm. It is important to note that this Euclidean norm is calculated *before* additional manual pruning on the data set is performed.

The following is a description of the data fields. Note that there are two types of files: Osama records and Obama records. Note also that there are four sections of data fields corresponding to the three sections of each file: the first set of shared fields, then the OSAMA or OBAMA term frequency fields (one for Obama files, one for Osama files) then a closing pair of shared fields.

The Data files of each type have a set of shared fields as noted below; these shared fields occur both before and after a set of term frequency fields that are particular to Osama or Obama.

OSAMA/OBAMA fields
------------------
timemachine-readable UNIX timestamp (UTC): seconds since 00:00:00 UTC 01-01-1970
iso_language_code
from_user_idthe unique user ID for the tweeter (required)
from_userthe twitter user name of the user who posted the tweet (required)
profile_img_urlthe URL for the tweeter's profile picture
to_user_idif the tweet (TEXT) contains an @ string, then twitter's unique userID for that user appears here (optional)
created_atformatted/human-readable time stamp
geo_typetype of geocode (probably should ignore the three geocode fields as they are almost all empty)
geo_coordinate_0lat in decimal format (probably should ignore the three geocode fields as they are almost all empty)
geo_coordinate_1long in decimal format (probably should ignore the three geocode fields as they are almost all empty)
search terms
post

All tweets were marked as EN language (English) yet not all of them actually are english.

Both files start at the same second: Tue, 10 May 2011 16:00:11 GMT (1305043211)

Note that this time is approximately two weeks after the reported killing of Osama Bin Laden; Twitter remained rife with tweets about Osama Bin Laden and Barak Obama's role in Osama's death.

The data gathered are in accordance with the Twitter API as of 06-10 May 2011. Since May 2011 there may have been some changes to the data format for tweets so the following may no longer be consistent with current API specs as of October 2011.

Sonification strategy (Scott Lindroth)

Sonification is done with SuperCollider (SC), a software synthesis/composition language. SC can parse .csv files and allow mapping of data points to synthesis/playback parameters. Two processes are running simultaneously during sonification:

1. SC compares the term fields of each pair of users in the data set (over 12M comparisons if all 5000 tweets are used). If two tweets share some minimum number of a terms, they are registered in a hash associating those terms with a list of users who tweeted those terms (now concactenated into a string).

2. The second process repeatedly reads through the hash, which, for this example, is limited to the first 1000 tweets. SC queries each key (a string of terms) to determine how many users are associated with that string. The following mappings are used:

Three instruments are assigned to different ranges in the pitch gamut: a bass guitar in the lowest register, something like a vibraphone in the middle register, and an square wave in the highest register. A fourth instrument, granular synthesized voices, is reserved for special conditions during sonification.
Each string of terms is associated with a particular pitch, starting low and ascending by step
The number of repetitions of the pitch associated with a string of search terms is determined by number of users associated with the string
Voices and chords occur when more than 4 users are associated with a particular string of terms

The pitches consist of a mixolydian scale projected over several octaves.

As these processes unfold, SC prints the string of terms (---> < string > ) followed by the list of users associated with that string. Note that the strings of terms do not comprise the full posts. The individual terms in each string are presented in alphabetical order. What you hear through sonification is growth of communities around particular tweets over time. The more repetitions of a note, the larger the community. The diversity of tweets is represented by more elaborate pitch sequences.

It is easy to hear the differences beween the two datasets ("osama" and "obama"). The former is almost solely focused on the death of bin Laden, which results large communities associated with particular tweets. Obama, on the other hand, has many more twitter topics, including health care, immigration, a commencement address, and, of course, the death of bin Laden. The resulting performance is less stable (i.e., the music jumps from note to note more quickly) because the communities associated with each tweet are smaller.

Osama dataset

Osama Tweets (new)
You may have to resize this to fit the screen (command "-" on a Mac)

Osama Tweets

Same algorithm/sound design applied to the "obama" dataset:

Obama Tweets