Team:Paris Liliane Bettencourt/Project/SIP

From 2010.igem.org

(Difference between revisions)

Latest revision as of 23:07, 27 October 2010

SIP Wiki Analyser

A new software tool to analyse iGEM projects

This year, iGEM team Paris has made a piece of software to analyze the word content of iGEM wikis. We initially wanted to try to use this software to predict who would win iGEM, but as it turns out it is useful for lots of other fun things as well.
To analyse the wikis, we have implemented an algorithm that uses Statistically Improbable Phrases (SIP) to try and extract the meaning of a text, an approach which has been used recently by Amazon to find relationships between books. An SIP is a word that appears with a higher frequency in your target (whether it is a book, or an iGEM wiki) than in a large dataset of background text. For example, a book about detectives might have "tweed jacket" or "corncob pipe" as SIPs because these words appear often in the book, but are not common words in the general corpus.

More formally :

f the frequency of the word in the target : f = o/n (o is the occurence of the word, and n the number of words in the wiki).
F the frequenc of the same word in a large sample : F = O/N (O, occurence in the sample, and N the number of words in the sample).
f/F is the "improbability factor" of a given SIP candidate.

Our algorithm uses all iGEM wikis as a background dataset, and then tries to find SIPs for each individual wiki. By using all of the wiki's text put together as a background, we are able to discard words that would be SIP's in a more general sense (clone, miniprep, igem) and instead focus on the words that make each project unique. Our software's workflow is essentially as follows: We use wget to retrieve all of the wiki pages from the iGEM server. We strip all non-alphanumeric characters and convert all remaining text to lower case. Finally, we compute the relative frequencies of each SIP in an sqlite3 database. We have extracted and calculated this for the last 3 years of iGEM, the data is available under the Download tab, for you to play with if you like.

We have used a visualization app called [http://www.wordle.net/create Wordle] to visualize our SIP results. Some of the graphics we made can be viewed under the Results tab!

In the future, we intend to make use of the ENTREZ API to get SIP data on each instructor's research output, and then compare the SIP's of instructors to those of their team to get an idea of how an advisor's research path influences their team's work. We are interested to see how teams that do projects wildly divergent from their "mother-labs" do in iGEM.
One other idea we had was to try to predict the future winner of the IGEM competition using our software. Basically the idea was to find SIPs for all the previous years' winners, giving them a coefficient depending on the prize the project got. Once this done, it is possible to compare each wiki with this database of "winning" SIPs. Will this comparison give clues for the future winner ? To our mind, it will be interesting to know... But it still has to be implemented.

We very much hope that other teams will make use of the data we've generated to run some of their own analyses!

References

http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases
http://petewarden.typepad.com/searchbrowser/2008/01/whats-the-secre.html

@@ Line 5: / Line 5: @@
 <p style="display:block">
 <a href=""https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Projects/SIP">
-  <img src="https://static.igem.org/mediawiki/2010/4/4c/SIP.png" width="75" height="75" title="SIP">
+  <img src="https://static.igem.org/mediawiki/2010/4/4c/SIP.png" width="148" height="120" title="SIP">
 </a>
   <font size=4>SIP Wiki Analyser </font>
+<a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/Synbioworld">
+ <img src="https://static.igem.org/mediawiki/2010/2/25/SBW.jpg" width="129" height="107" align=right title="SynBioWorld">
+</a>
 <a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/Population_counter">
-  <img src="https://static.igem.org/mediawiki/2010/3/30/Popcount.png" width="75" height="75" align=right title="Population Counter">
+  <img src="https://static.igem.org/mediawiki/2010/9/93/Pop_counter_logo-01.jpg" width="129" height="107" align=right title="Population Counter">
 </a>
 <a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/Memo-cell">
-  <img src="https://static.igem.org/mediawiki/2010/e/e8/Memocell.png" width="75" height="75" align=right title="Memo-Cell">
+  <img src="https://static.igem.org/mediawiki/2010/a/aa/Memo_cell-01.jpg" width="129" height="107" align=right title="Memo-Cell">
 </a> <br />
 </p>
@@ Line 34: / Line 38: @@
 </html>
-== Find iGEM winner with satistics ==
+==  A new software tool to analyse iGEM projects ==
 <p style="display:block">
 <br />
-This year, iGEM team Paris has made a software to '''analyse wiki''', especially words in the wiki. The first view to use this software was, to find '''who will win the iGEM competition'''. We had several aproach to find that, using data on the wikis.<br />
+This year, iGEM team Paris has made a piece of software to analyze the word content of iGEM wikis.  We initially wanted to try to use this software to predict who would win iGEM, but  as it turns out it is useful for lots of other fun things as well.<br />
-So to analyse the wiki, we have implemented an algorithm called '''SIP''' ('''Statistically Improbable Phrases'''), used by the Amazon website to caracterize their books. We try to find which are the most improbable words in a large sample (we use all iGEM wiki) but the most probable in the context. We can, with this way, have some '''original words''' from some teams, but also '''characterize''' their wiki by specifics words or terms. Amazon algorithm is not open, so we have use simple math to describe the system.
+To analyse the wikis, we have implemented an algorithm that uses '''Statistically Improbable Phrases''' (SIP) to try and extract the meaning of a text, an approach which has been used recently by Amazon to find relationships between books.  An SIP is a word that appears with a higher frequency in your target (whether it is a book, or an iGEM wiki) than in a large dataset of background text.  For example, a book about detectives might have "tweed jacket" or "corncob pipe" as SIPs because these words appear often in the book, but are not common words in the general corpus.
 <br />
 <br />
-In Math language :
+More formally :
-* We have '''f/F'''
+* '''f''' the frequency of the word in the target : '''f = o/n''' ('''o''' is the occurence of the word, and '''n''' the number of words in the wiki).
-* '''f''' the frequence of the word in the context : '''f = o/n''' ('''o''' is the occurence of the word, and '''n''' the number of words in the wiki).
+* '''F''' the frequenc of the same word in a large sample : '''F = O/N''' ('''O''', occurence in the sample, and '''N''' the number of words in the sample).
-* '''F''' the frequence of the same word in a large sample : '''F = O/N''' ('''O''', occurence in the sample, and '''N''' the number of words in the sample).
+* '''f/F''' is the "improbability factor" of a given SIP candidate.
 <br />
+Our algorithm uses all iGEM wikis as a background dataset, and then tries to find SIPs for each individual wiki.  By using all of the wiki's text put together as a background, we are able to discard words that would be SIP's in a more general sense (clone, miniprep, igem) and instead focus on the words that make each project unique.
-In this code, we use''' wget''' software to get all wiki of one year, then we remove all character non-alpha-numeric, and convert to lower case. New we compute the frenquencies of each words in a '''sqlite3 database'''.<br />
-<font color="green">'''We have extracted and calculated this for the last 3 years. Available on the bottom of the page.'''</font>
+Our software's workflow is essentially as follows: We use wget to retrieve all of the wiki pages from the iGEM server.  We strip all non-alphanumeric characters and convert all remaining text to lower case.  Finally, we compute the relative frequencies of each SIP in an sqlite3 database.
+<font color="green">'''We have extracted and calculated this for the last 3 years of iGEM, the data is available under the Download tab, for you to play with if you like.'''</font>
 <br /><br />
-We have '''displayed these words''' with graphical effects using [http://www.wordle.net/create this software written in java] with different parameters.
+We have used a visualization app called [http://www.wordle.net/create Wordle] to visualize our SIP results.  Some of the graphics we made can be viewed under the Results tab!
 <br />
 <br />
-For future, we will try next time to compare this words with words used in Instructor's papers (using ENTREZ API), to determine how '''Instructors influence''' the team. We think in the last part that an influence by the Instructor has positive results on the team, a '''better cooperation''' between memberships.
+In the future, we intend to make use of the ENTREZ API to get SIP data on each instructor's research output, and then compare the SIP's of instructors to those of their team to get an idea of how an advisor's research path influences their team's work.  We are interested to see how teams that do projects wildly divergent from their "mother-labs" do in iGEM.
+<br />
+One other idea we had was to try to predict the future winner of the IGEM competition using our software. Basically the idea was to find SIPs for all the previous years' winners, giving them a coefficient  depending on the prize the project got. Once this done, it is possible to compare each wiki with this database of "winning" SIPs. Will this comparison give clues for the future winner ? To our mind, it will be interesting to know... But it still has to be implemented.
+<br /><br />
+We very much hope that other teams will make use of the data we've generated to run some of their own analyses!
 <br />
 <br />