Team:Paris Liliane Bettencourt/Project/SIP

From 2010.igem.org

(Difference between revisions)
(Find iGEM winner with satistics)
Line 39: Line 39:
<br />
<br />
-
So this year, iGEM team Paris try to find '''who will win the iGEM competition'''. We have several aproach to find that, using data on the wikis.<br />
+
This year, iGEM team Paris will try to find '''who will win the iGEM competition'''. We have several aproach to find that, using data on the wikis.<br />
-
To analyse the wiki, we have implemented an algorithm called '''SIP''' ('''Statistically Improbable Phrases'''), used by the Amazon website to caracterize their books. We try to find which are the most improbable words in a large sample (we use all iGEM wiki) but the most probable in the context. We can, with this way, have some '''original words''' from some team, but also '''characterize''' their wiki by specifics words or termes.
+
To analyse the wiki, we have implemented an algorithm called '''SIP''' ('''Statistically Improbable Phrases'''), used by the Amazon website to caracterize their books. We try to find which are the most improbable words in a large sample (we use all iGEM wiki) but the most probable in the context. We can, with this way, have some '''original words''' from some teams, but also '''characterize''' their wiki by specifics words or terms.
<br />
<br />
Line 55: Line 55:
<font color="green">'''We have extract and calculate this for the last 3 years. Available on the bottom of the page.'''</font>
<font color="green">'''We have extract and calculate this for the last 3 years. Available on the bottom of the page.'''</font>
<br /><br />
<br /><br />
-
We have '''display these words''' with graphical effects using [http://www.wordle.net/create this software written in java] with different parameters.
+
We have '''displayed these words''' with graphical effects using [http://www.wordle.net/create this software written in java] with different parameters.
<br />
<br />
<br />
<br />
Line 63: Line 63:
<br />
<br />
</p>
</p>
-
 
== References ==
== References ==

Revision as of 17:02, 24 October 2010



SIP Wiki Analyser





Find iGEM winner with satistics


This year, iGEM team Paris will try to find who will win the iGEM competition. We have several aproach to find that, using data on the wikis.
To analyse the wiki, we have implemented an algorithm called SIP (Statistically Improbable Phrases), used by the Amazon website to caracterize their books. We try to find which are the most improbable words in a large sample (we use all iGEM wiki) but the most probable in the context. We can, with this way, have some original words from some teams, but also characterize their wiki by specifics words or terms.

In Math language :

  • We have f/F
  • f the frequence of the word in the context : f = o/n (o is the occurence of the word, and n the number of words in the wiki).
  • F the frequence of the same word in a large sample : F = O/N (O, occurence in the sample, and N the number of words in the sample).

In this code, we use wget software to get all wiki of one year, then we remove all character non-alpha-numeric, and convert to lower case. New we compute the frenquencies of each words in a sqlite3 database.
We have extract and calculate this for the last 3 years. Available on the bottom of the page.

We have displayed these words with graphical effects using [http://www.wordle.net/create this software written in java] with different parameters.

For future, we will try next time to compare this words with words used in Instructor's papers (using ENTREZ API), to determine how Instructors influence the team. We think in the last part that an influence by the Instructor has positive results on the team, a better cooperation between memberships.

References

  • http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases
  • http://petewarden.typepad.com/searchbrowser/2008/01/whats-the-secre.html