Team:Paris Liliane Bettencourt/Project/SIP
From 2010.igem.org
Ericmeltzer (Talk | contribs) (→Find iGEM winner with satistics) |
Ericmeltzer (Talk | contribs) |
||
Line 39: | Line 39: | ||
<br /> | <br /> | ||
- | This year, iGEM team Paris has made a software to analyze the word content of iGEM wikis. | + | This year, iGEM team Paris has made a piece of software to analyze the word content of iGEM wikis. We initially wanted to try to use this software to predict who would win iGEM, but as it turns out it is useful for lots of other fun things as well.<br /> |
- | To analyse the | + | |
+ | To analyse the wikis, we have implemented an algorithm that uses '''Statistically Improbable Phrases''' (SIP) to try and extract the meaning of a text, an approach which has been used recently by Amazon to find relationships between books. An SIP is a word that appears with a higher frequency in your target (whether it is a book, or an iGEM wiki) than in a large dataset of background text. For example, a book about detectives might have "tweed jacket" or "corncob pipe" as SIPs because these words appear often in the book, but are not common words in the general corpus. | ||
<br /> | <br /> | ||
<br /> | <br /> | ||
- | + | More formally : | |
- | + | * '''f''' the frequency of the word in the target : '''f = o/n''' ('''o''' is the occurence of the word, and '''n''' the number of words in the wiki). | |
- | * '''f''' the | + | * '''F''' the frequenc of the same word in a large sample : '''F = O/N''' ('''O''', occurence in the sample, and '''N''' the number of words in the sample). |
- | * '''F''' the | + | * '''f/F''' is the "improbability factor" of a given SIP candidate. |
+ | |||
<br /> | <br /> | ||
+ | Our algorithm uses all iGEM wikis as a background dataset, and then tries to find SIPs for each individual wiki. By using all of the wiki's text put together as a background, we are able to discard words that would be SIP's in a more general sense (clone, miniprep, igem) and instead focus on the words that make each project unique. | ||
In this code, we use''' wget''' software to get all wiki of one year, then we remove all character non-alpha-numeric, and convert to lower case. New we compute the frenquencies of each words in a '''sqlite3 database'''.<br /> | In this code, we use''' wget''' software to get all wiki of one year, then we remove all character non-alpha-numeric, and convert to lower case. New we compute the frenquencies of each words in a '''sqlite3 database'''.<br /> |
Revision as of 20:42, 26 October 2010
Find iGEM winner with statistics
This year, iGEM team Paris has made a piece of software to analyze the word content of iGEM wikis. We initially wanted to try to use this software to predict who would win iGEM, but as it turns out it is useful for lots of other fun things as well.
To analyse the wikis, we have implemented an algorithm that uses Statistically Improbable Phrases (SIP) to try and extract the meaning of a text, an approach which has been used recently by Amazon to find relationships between books. An SIP is a word that appears with a higher frequency in your target (whether it is a book, or an iGEM wiki) than in a large dataset of background text. For example, a book about detectives might have "tweed jacket" or "corncob pipe" as SIPs because these words appear often in the book, but are not common words in the general corpus.
More formally :
- f the frequency of the word in the target : f = o/n (o is the occurence of the word, and n the number of words in the wiki).
- F the frequenc of the same word in a large sample : F = O/N (O, occurence in the sample, and N the number of words in the sample).
- f/F is the "improbability factor" of a given SIP candidate.
Our algorithm uses all iGEM wikis as a background dataset, and then tries to find SIPs for each individual wiki. By using all of the wiki's text put together as a background, we are able to discard words that would be SIP's in a more general sense (clone, miniprep, igem) and instead focus on the words that make each project unique. In this code, we use wget software to get all wiki of one year, then we remove all character non-alpha-numeric, and convert to lower case. New we compute the frenquencies of each words in a sqlite3 database.
We have extracted and calculated this for the last 3 years. Available on the bottom of the page.
We have displayed these words with graphical effects using [http://www.wordle.net/create this software written in java] with different parameters.
For future, we will try next time to compare this words with words used in Instructor's papers (using ENTREZ API), to determine how Instructors influence the team. We think in the last part that an influence by the Instructor has positive results on the team, a better cooperation between memberships.
References
- http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases
- http://petewarden.typepad.com/searchbrowser/2008/01/whats-the-secre.html