Team:St Andrews/project/ethics/communication
From 2010.igem.org
(→Technical Solution) |
(→Implementation) |
||
(38 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
{{:Team:St_Andrews/defaulttemplate}} | {{:Team:St_Andrews/defaulttemplate}} | ||
- | <html><h1> | + | <html><h1>Communication: Global</h1></html> |
=Premise= | =Premise= | ||
- | Realtime Internet communication is | + | Realtime Internet communication is increasingly common, the so called facebook generation are growing up acquainted with a dizzying array of instantaneous comunication methods. The inception of email was heralded as a revolution in communication, today the quantity of email traffic is at an all time low. In place of email instant messaging and social network messaging have come to precidence. Combined with the vast quantities of blogs, forum posts, wikis and other forms of user generated content the volume of publically accessible communications is immense. From a human practices perspective this provides a vast and frequently changing dataset which gives insight into how people communicate. |
+ | |||
+ | Given a moderately controversial topic such as synthetic biology this allows us to investigate a number of different metrics and trends relating to its appearance in communications. By gathering communications over a period of time we are able to investigate the following phenomena: | ||
+ | |||
+ | * How popular is synthetic biology in relation to search terms associated with popular culture (football, lady gaga, politics etc)? | ||
+ | * How popular is synthetic biology in relation to traditional sciences? | ||
+ | * How popular is synthetic biology in relation to other domain specific sciences? | ||
+ | |||
+ | If the collected data is parsed using a natural language analysis library such as the excellent [http://www.cs.pitt.edu/mpqa/opinionfinderrelease/ OpinionFinder] from the University of Pittsburgh one can draw even greater conclusions. Although the use of natural language analysis software does not guarantee accurate analysis of language it provides a good attempt. Identifying "Synthetic biology is good" as positive and "synthetic biology is evil" as negative is not a challenge, however "systems biology is excellent, unlike synthetic biology" and "synthetic biology is good at being bad" pose a greater challenge. However given training and the methodical checking of selective sets of output lessens the risk of spurious output. We are able to acquire a set of numerical figures representing the overall opinion of each collection of data. Given this we are able to analyse data gathered over a period of time to investigate the following: | ||
+ | |||
+ | * The standard deviation of opinion of synthetic biology, is opinion consistent or frequently changing? | ||
+ | * The standard deviation of opinion of synthetic biology compared to the standard deviation of opinion of traditional science, how do the consistence of opinions differ? | ||
+ | * The standard deviation of opinion of synthetic biology compared to the standard deviation of opinion of elements of popular culture, how do the consistencies of opinions differ? | ||
=Technical Solution= | =Technical Solution= | ||
- | Before one can reap the benefits of having access to such a great pool of data one must answer the challenge of collecting this data. The web stores exobytes of data hence collecting and parsing the entirety of the available data is simply not an option. However this is not required, when interested in a set of related terms such as {synthetic biology, synbio, igem} one can disregard large portions of the web. Furthermore if one is considering gathering data relating to social | + | ==Theory== |
+ | Before one can reap the benefits of having access to such a great pool of data one must answer the challenge of collecting this data. The web stores exobytes of data, hence collecting and parsing the entirety of the available data is simply not an option. However this is not required, when interested in a set of related terms such as {synthetic biology, synbio, igem} one can disregard large portions of the web. Furthermore if one is considering gathering data relating to social communication, then a number of start points quickly become apparent. Firstly serveral social networks offer a fairly standard XML based API and secondly virtually every so called web 2.0 site organises data via some form of chronological hierarchy (be it through metadata or simply via the removal of old articles from the home page of the site). These two features of the web allow for us to deduce a simple algorithm for collecting data. This algorithm would start at a number of popular hubs of discourse (such as large news sites, social networks, newspapers, journals, blogs etc) and procede to continue through every site linked from each site so long as a term of interest is found. Given sufficient running time this algorithm will crawl through all sites which have a path from any of the original sites. Given a sufficiently widespread set of start sites, this algorithm will encompass all sites on the web. This is not always required or (when using a remarkably generic search term) feasible and thus one may wish to impose an artificial limitation. Thus this algorithm will perform a best attempt to acquire as large a quantity of timely social communications regarding a chosen subject. | ||
Line 24: | Line 37: | ||
else | else | ||
add all hyperlinks in result to S | add all hyperlinks in result to S | ||
- | output | + | output result-$(date) |
+ | |||
+ | This algorithm will output list of phrases extracted from the web pertaining to the search phrase. This output is then redirected to a file and serves as the input for OpinionFinder to analyse. | ||
+ | |||
+ | The following is a visualisation of the crawler algorithm: | ||
+ | [[Image:StAndrews Crawler.png | 800px]] | ||
+ | |||
+ | ==Implementation== | ||
+ | [https://static.igem.org/mediawiki/2010/0/08/StAndrews_Crawler.txt Python Implementation of the Algorithm] | ||
+ | |||
+ | =Results= | ||
+ | The following graphs are the processed output of the crawler program. The collected data was parsed and analysed. There are two types of result shown, one showing the opinion as to a subject. This is drawn from the output of OpinionFinder which examined the collected internet data. Opinion is ranked in terms of the language used to describe search terms, it is ranked from 0 to 100. Where 0 is extremely dissmisve towards a term and 100 is extremely positive towards a term. The second result shows the amount of discussion relating to a collected search term. This figure represents what percentage of the overall content crawled as part of the searching for the provided terms relate to each term. | ||
+ | == The Opinion of Synthetic Biology and the Opinion of Genetic Engineering Over Time== | ||
+ | The initial peak in synthetic biology is likely due to the resultant media following the sucesss of the J Craig Venter Institute in May 2010. The language used by more conventional science media is far more convervative and ranks lower thus the peak could be due to more sensationalist media coverage. This subsides after 2 weeks. Genetic Engineering alternatively is relatively constant and barely fluctuates, likes due to the lack of media coverage at the time. | ||
+ | [[Image:StAndrews_sb-gen.png | 800px]] | ||
+ | == The Opinion as to Popular Sciences and Synthetic Biology Over Time == | ||
+ | The most notable phenomena within this result is the great variation in synthetic biology when compared to tje other sciences. The three traditional sciences deviate very little, this is predomenantly due to the language used in scienctific media being relatively unbiased. Computer Science is discussed frequently in non scientific contexts and thus is subject to a wide range of dictions which influence the result. The reason for Mathematics low score is non obvious and may too be due to discussion in a non measured forum or may be due to the subject being generally discussed less favorably. | ||
+ | |||
+ | [[Image:Pop sciences.png | 800px]] | ||
+ | |||
+ | == The Amount of Discussion of Sciences Over Time== | ||
+ | Once again synthetic biology is noteable. This time for being discussed orders of magnitude less frequently than more traditional science. Thus, while synthetic biology may be subject to ranges of massively differing opinions it is not subject to massive amounts of conversation as per the less sensationalist sciences. This is also in part due to the names of the three traditional sciences being used often in general discussion. | ||
+ | [[Image:St_Andrews_amount_discussion.png | 800px]] | ||
+ | == The Opinion of Various Popular Culture Elements and Synthetic Biology Over Time== | ||
+ | This result provides a frame of reference for synthetic biology as to how it pertains to other more commonly discussed elements. Interestingly it compares favorably as synthetic biology is seldom discussed in the remarkably polemic manner in which popular culture is discussed. | ||
+ | [[Image:StAndrews others.png | 800px]] | ||
+ | |||
+ | == The Amount of Discussion of Popular Culture Elements and Synthetic Biology over Time == | ||
+ | Politics, footbal and Lady Gaga occur more often as Synthetic Biology. Our crawled text is sufficiently general for a comparison. It is clearly shown that popular culture elements are discussed far more regularly than synthetic biology. | ||
+ | [[Image:St Andrews Amount-other.png | 800px]] | ||
+ | |||
+ | =Conclusions= | ||
+ | In the spirt of iGEM this methodology presents a technological solution to a real world problem. Traditional surveys are limited in their capacity. They are extremely limited by their range. Even surveys distributed over the web or manually in multiple countries touch but a tiny fraction of the worlds population. Traditional data gathering techniques are also invasive and do not gurantee genuine responces. The solution attempts to rectifiy these problems by sampling the vast range of web accessible content on a diverse collection of subjects. In doing so it is possible to gather up to date data from a vast range of sources, far beyond the reach of more traditional techniques. It too is easily repeatable and does not require manpower or participants to run a traditional survey. It is not however without it's drawbacks. The computational analysis of language is far from perfect and can lead of inacurate conclusions. Additionally a signficant drawback is that this methodology only aquires data from persons already aquainted with the subject matter and cannot seek information from uninitiated parties. For this purpose more traditional data collection techniques are prefered. | ||
+ | |||
+ | The results prsented here are a proof of concept. We show how human practices can be investigated technologically and how iGEM can gain a wider view of the world. The results displayed are a fraction of the possibilites that our methodology could be applied to. Time, storage and compute time limited what could be done however we present a valid first step towards investigating human practices in a more traditionally scientific manner. The program we provide is capable of resolving the country of origin of a web page and in some cases (the case of blogs) the age of page. Future work could examine geographic spread of opinion and consider historical data. The best application of this methodology is in the augmentation of traditional survey methods. Through combining both sociological methods and technical methods it is possible to capture an in depth perspective of the world as it pertains to a subject to serve as an excellent building block for a human practices investigation. |
Latest revision as of 01:22, 28 October 2010
Communication: Global
Premise
Realtime Internet communication is increasingly common, the so called facebook generation are growing up acquainted with a dizzying array of instantaneous comunication methods. The inception of email was heralded as a revolution in communication, today the quantity of email traffic is at an all time low. In place of email instant messaging and social network messaging have come to precidence. Combined with the vast quantities of blogs, forum posts, wikis and other forms of user generated content the volume of publically accessible communications is immense. From a human practices perspective this provides a vast and frequently changing dataset which gives insight into how people communicate.
Given a moderately controversial topic such as synthetic biology this allows us to investigate a number of different metrics and trends relating to its appearance in communications. By gathering communications over a period of time we are able to investigate the following phenomena:
- How popular is synthetic biology in relation to search terms associated with popular culture (football, lady gaga, politics etc)?
- How popular is synthetic biology in relation to traditional sciences?
- How popular is synthetic biology in relation to other domain specific sciences?
If the collected data is parsed using a natural language analysis library such as the excellent [http://www.cs.pitt.edu/mpqa/opinionfinderrelease/ OpinionFinder] from the University of Pittsburgh one can draw even greater conclusions. Although the use of natural language analysis software does not guarantee accurate analysis of language it provides a good attempt. Identifying "Synthetic biology is good" as positive and "synthetic biology is evil" as negative is not a challenge, however "systems biology is excellent, unlike synthetic biology" and "synthetic biology is good at being bad" pose a greater challenge. However given training and the methodical checking of selective sets of output lessens the risk of spurious output. We are able to acquire a set of numerical figures representing the overall opinion of each collection of data. Given this we are able to analyse data gathered over a period of time to investigate the following:
- The standard deviation of opinion of synthetic biology, is opinion consistent or frequently changing?
- The standard deviation of opinion of synthetic biology compared to the standard deviation of opinion of traditional science, how do the consistence of opinions differ?
- The standard deviation of opinion of synthetic biology compared to the standard deviation of opinion of elements of popular culture, how do the consistencies of opinions differ?
Technical Solution
Theory
Before one can reap the benefits of having access to such a great pool of data one must answer the challenge of collecting this data. The web stores exobytes of data, hence collecting and parsing the entirety of the available data is simply not an option. However this is not required, when interested in a set of related terms such as {synthetic biology, synbio, igem} one can disregard large portions of the web. Furthermore if one is considering gathering data relating to social communication, then a number of start points quickly become apparent. Firstly serveral social networks offer a fairly standard XML based API and secondly virtually every so called web 2.0 site organises data via some form of chronological hierarchy (be it through metadata or simply via the removal of old articles from the home page of the site). These two features of the web allow for us to deduce a simple algorithm for collecting data. This algorithm would start at a number of popular hubs of discourse (such as large news sites, social networks, newspapers, journals, blogs etc) and procede to continue through every site linked from each site so long as a term of interest is found. Given sufficient running time this algorithm will crawl through all sites which have a path from any of the original sites. Given a sufficiently widespread set of start sites, this algorithm will encompass all sites on the web. This is not always required or (when using a remarkably generic search term) feasible and thus one may wish to impose an artificial limitation. Thus this algorithm will perform a best attempt to acquire as large a quantity of timely social communications regarding a chosen subject.
The pseudo code of the algorithm is as follows:
crawler(searchterm): links = Stack S S = {facebook, twitter, bbcnews, cnn, foxnews, guardian, times, nytimes .. myawesomeblog} while S is not empty or arbritary threshold: crawlerparser(S, searchterm)
crawlerparser(links, searchterm): for each link in links results = results += link containing searchterm for each result in results if result is old /* either is result an earlier dated result file or metadata identifies as old */ disregard else add all hyperlinks in result to S output result-$(date)
This algorithm will output list of phrases extracted from the web pertaining to the search phrase. This output is then redirected to a file and serves as the input for OpinionFinder to analyse.
The following is a visualisation of the crawler algorithm:
Implementation
Python Implementation of the Algorithm
Results
The following graphs are the processed output of the crawler program. The collected data was parsed and analysed. There are two types of result shown, one showing the opinion as to a subject. This is drawn from the output of OpinionFinder which examined the collected internet data. Opinion is ranked in terms of the language used to describe search terms, it is ranked from 0 to 100. Where 0 is extremely dissmisve towards a term and 100 is extremely positive towards a term. The second result shows the amount of discussion relating to a collected search term. This figure represents what percentage of the overall content crawled as part of the searching for the provided terms relate to each term.
The Opinion of Synthetic Biology and the Opinion of Genetic Engineering Over Time
The initial peak in synthetic biology is likely due to the resultant media following the sucesss of the J Craig Venter Institute in May 2010. The language used by more conventional science media is far more convervative and ranks lower thus the peak could be due to more sensationalist media coverage. This subsides after 2 weeks. Genetic Engineering alternatively is relatively constant and barely fluctuates, likes due to the lack of media coverage at the time.
The Opinion as to Popular Sciences and Synthetic Biology Over Time
The most notable phenomena within this result is the great variation in synthetic biology when compared to tje other sciences. The three traditional sciences deviate very little, this is predomenantly due to the language used in scienctific media being relatively unbiased. Computer Science is discussed frequently in non scientific contexts and thus is subject to a wide range of dictions which influence the result. The reason for Mathematics low score is non obvious and may too be due to discussion in a non measured forum or may be due to the subject being generally discussed less favorably.
The Amount of Discussion of Sciences Over Time
Once again synthetic biology is noteable. This time for being discussed orders of magnitude less frequently than more traditional science. Thus, while synthetic biology may be subject to ranges of massively differing opinions it is not subject to massive amounts of conversation as per the less sensationalist sciences. This is also in part due to the names of the three traditional sciences being used often in general discussion.
The Opinion of Various Popular Culture Elements and Synthetic Biology Over Time
This result provides a frame of reference for synthetic biology as to how it pertains to other more commonly discussed elements. Interestingly it compares favorably as synthetic biology is seldom discussed in the remarkably polemic manner in which popular culture is discussed.
The Amount of Discussion of Popular Culture Elements and Synthetic Biology over Time
Politics, footbal and Lady Gaga occur more often as Synthetic Biology. Our crawled text is sufficiently general for a comparison. It is clearly shown that popular culture elements are discussed far more regularly than synthetic biology.
Conclusions
In the spirt of iGEM this methodology presents a technological solution to a real world problem. Traditional surveys are limited in their capacity. They are extremely limited by their range. Even surveys distributed over the web or manually in multiple countries touch but a tiny fraction of the worlds population. Traditional data gathering techniques are also invasive and do not gurantee genuine responces. The solution attempts to rectifiy these problems by sampling the vast range of web accessible content on a diverse collection of subjects. In doing so it is possible to gather up to date data from a vast range of sources, far beyond the reach of more traditional techniques. It too is easily repeatable and does not require manpower or participants to run a traditional survey. It is not however without it's drawbacks. The computational analysis of language is far from perfect and can lead of inacurate conclusions. Additionally a signficant drawback is that this methodology only aquires data from persons already aquainted with the subject matter and cannot seek information from uninitiated parties. For this purpose more traditional data collection techniques are prefered.
The results prsented here are a proof of concept. We show how human practices can be investigated technologically and how iGEM can gain a wider view of the world. The results displayed are a fraction of the possibilites that our methodology could be applied to. Time, storage and compute time limited what could be done however we present a valid first step towards investigating human practices in a more traditionally scientific manner. The program we provide is capable of resolving the country of origin of a web page and in some cases (the case of blogs) the age of page. Future work could examine geographic spread of opinion and consider historical data. The best application of this methodology is in the augmentation of traditional survey methods. Through combining both sociological methods and technical methods it is possible to capture an in depth perspective of the world as it pertains to a subject to serve as an excellent building block for a human practices investigation.