Team:St Andrews/project/ethics/communication

From 2010.igem.org

Revision as of 15:21, 7 September 2010 by Jaunty (Talk | contribs)


St Andrews from East Sands

University of St Andrews iGEM 2010

Welcome!

The Saints

University of St Andrews iGEM 2010

Our first year at iGEM!

Communicaiton

Premise

Realtime Internet communication is incrasingly common, the so called facebook generation are growing up aquainted with a dizzying array of instantanous comunication methods. The inception of email was hearlded as a revolution in communication, today the quantity of email traffic is at an all time low. In place of email instance messaging and social network messaging have come to precidence. Combined with the vast quanitites of blogs, forum posts, wikis and other forms of user generated content the volume of publically acessible communications is immense. From a human practices perspective this provides a vast and frequently changing dataset which gives insight into how people communicate.

Technical Solution

Before one can reap the benefits of having access to such a great pool of data one must answer the challenge of collecting this data. The web stores exobytes of data hence collecting and parsing the entirety of the available data is simply not an option. However this is not required, when interested in a set of related terms such as {synthetic biology, synbio, igem} one can disregard large portions of the web. Furthermore if one is considering gathering data relating to social commuinication then a number of start points quickly become apparent. Firstly serveral social networks offer a fairly standard XML based API and secondly virtually every so called web 2.0 site organises data via some form of chronological hierarchy (be it through metadata or simply via the removal of old articles from the home page of the site). These two features of the web allow for us to deduce a simple algorithm for collecting data. This algorithm would start at a number of popular hubs of discourse (such as large news sites, social networks, newspapers, journals, blogs etc) and procede to continue through every site linked from each site so long as a term of interest is found. Given sufficient running time this algorithm will crawl through all sites which have a path from any or the original sites. Given a sufficiently widespread set of start sites this algorithm will encompass all sites on the web. This is not always required or (when using a remarkably generic search term) feasible and thus one may wish to impose an artificial limitation. Thus this algorithm will perform a best attempt to acquire as large a quantity of timely social communications regarding a chosen subject.


The pseudo code of the algorithm is as follows:

crawler(searchterm):
 links = Stack S
 S = {facebook, twitter, bbcnews, cnn, foxnews, guardian, times, nytimes .. myawesomeblog}
 while S is not empty or arbritary threshold: 
   crawlerparser(S, searchterm)
crawlerparser(links, searchterm):
 for each link in links
   results = results += link containing searchterm
   for each result in results
     if result is old /* either is result an earlier dated result file or metadata identifies as old */
       disregard
     else 
       add all hyperlinks in result to S
       output to file result-$(date)