Team:Paris Liliane Bettencourt/Project/SIP

From 2010.igem.org

(Difference between revisions)
 
(28 intermediate revisions not shown)
Line 1: Line 1:
{{Template:Paris2010_2}}
{{Template:Paris2010_2}}
-
== Find iGEM winner with satistics==
 
 +
<html>
<p style="display:block">
<p style="display:block">
-
So this year, iGEM team Paris try to find '''who will win the iGEM competition'''. We have several aproach to find that, using data on the wikis.<br />
+
<a href=""https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Projects/SIP">
-
To analyse the wiki, we have implemented an algorithm called '''SIP''' ('''Statistically Improbable Phrases'''), used by the Amazon website to caracterize their books. We try to find which are the most improbable words in a large sample (we use all iGEM wiki) but the most probable in the context.
+
<img src="https://static.igem.org/mediawiki/2010/4/4c/SIP.png" width="148" height="120" title="SIP">
 +
</a>
-
<br />
+
<font size=4>SIP Wiki Analyser </font>
-
<br />
+
<a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/Synbioworld">
 +
<img src="https://static.igem.org/mediawiki/2010/2/25/SBW.jpg" width="129" height="107" align=right title="SynBioWorld">
 +
</a>
 +
<a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/Population_counter">
 +
<img src="https://static.igem.org/mediawiki/2010/9/93/Pop_counter_logo-01.jpg" width="129" height="107" align=right title="Population Counter">
 +
</a>
 +
<a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/Memo-cell">
 +
<img src="https://static.igem.org/mediawiki/2010/a/aa/Memo_cell-01.jpg" width="129" height="107" align=right title="Memo-Cell">
 +
</a> <br />
 +
</p>
 +
</html>
-
In Math language :
+
----
-
* We have '''f/F'''
+
-
* '''f''' the frequence of the word in the context : '''f = o/n''' ('''o''' is the occurence of the word, and '''n''' the number of words in the wiki).
+
-
* '''F''' the frequence of the same word in a large sample : '''F = O/N''' ('''O''', occurence in the sample, and '''N''' the number of words in the sample).
+
-
<br />
+
<html>
 +
<p style="display:block;">
 +
<div class="submenu">
 +
<ul id="submenu_3">
 +
<li><a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/SIP" target="_self">Overview</a></li>
 +
  <li><a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/SIP/Results" target="_self">Results</a></li>
 +
  <li><a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/SIP/Downloads" target="_self">Downloads</a></li>
 +
  <li><a href="https://2010.igem.org/Team:Paris_Liliane_Bettencourt/Project/SIP/Codes" target="_self">Codes</a></li>
 +
</ul>
 +
</div>
-
In this code, we use''' wget''' software to get all wiki of one year, then we remove all character non-alpha-numeric, and convert to lower case. New we compute the frenquencies of each words in a '''sqlite3 database'''.<br />
+
<br /><br /><br />
-
<font color="green">'''We have extract and calculate this for the last 3 years. Available on the bottom of the page.'''</font>
+
-
<br /><br />
+
-
We will try next to '''display these words''' with graphical effects using a last igem paris software written in java and also we want to compare this words with words used in Instructor's papers (using ENTREZ API), to determine how '''Instructors influence''' the team. We think in the last part that an influence by the Instructor has positive results on the team, a '''better cooperation''' between memberships.
+
-
 
+
-
<br />
+
-
<br />
+
</p>
</p>
 +
</html>
 +
 +
==  A new software tool to analyse iGEM projects ==
-
== SIP.C ==
 
<p style="display:block">
<p style="display:block">
-
<pre>
+
<br />
-
/* SIP.C *** Make SIP database and dictionary
+
-
PUBLIC DOMAIN
+
This year, iGEM team Paris has made a piece of software to analyze the word content of iGEM wikis.  We initially wanted to try to use this software to predict who would win iGEM, but  as it turns out it is useful for lots of other fun things as well.<br />
-
From iGEM team 2010 Paris
+
-
comment:
+
To analyse the wikis, we have implemented an algorithm that uses '''Statistically Improbable Phrases''' (SIP) to try and extract the meaning of a text, an approach which has been used recently by Amazon to find relationships between books.  An SIP is a word that appears with a higher frequency in your target (whether it is a book, or an iGEM wiki) than in a large dataset of background text.  For example, a book about detectives might have "tweed jacket" or "corncob pipe" as SIPs because these words appear often in the book, but are not common words in the general corpus. 
-
This code permit you to calculate SIP words (most improbable words) in each
+
<br />
-
wiki team.
+
<br />
-
You need sqlite3, wget and html2text to run this program.
+
-
build:
+
More formally :
 +
* '''f''' the frequency of the word in the target : '''f = o/n''' ('''o''' is the occurence of the word, and '''n''' the number of words in the wiki).
 +
* '''F''' the frequenc of the same word in a large sample : '''F = O/N''' ('''O''', occurence in the sample, and '''N''' the number of words in the sample).
 +
* '''f/F''' is the "improbability factor" of a given SIP candidate.
-
$ gcc -o sip sip.c -lsqlite3
 
-
usage :
+
<br />
 +
Our algorithm uses all iGEM wikis as a background dataset, and then tries to find SIPs for each individual wiki.  By using all of the wiki's text put together as a background, we are able to discard words that would be SIP's in a more general sense (clone, miniprep, igem) and instead focus on the words that make each project unique. 
-
$ ./sip [list of team name] [name of database] [year]
 
-
 
-
 
-
Sqlite3 database :
 
-
 
-
Table for team.
 
-
+--------+-----------+------------+-----------+
 
-
| Words  | local occ | local freq | SIP value |
 
-
+--------+-----------+------------+-----------+
 
-
| string |  u_long  |  float    |  float  | 
 
-
|        |          |            |          |
 
-
+--------+-----------+------------+-----------+
 
-
 
-
 
-
Table name : Dictionary.
 
-
+--------+------------+-------------+
 
-
| word  | global_occ | global_freq |
 
-
+--------+------------+-------------+
 
-
| string |  u_long  |  float      |
 
-
|        |            |            |
 
-
+--------+------------+-------------+
 
-
 
-
*/
 
-
 
-
#include <stdio.h>
 
-
#include <stdlib.h>
 
-
#include <string.h>
 
-
#include <sqlite3.h>
 
-
 
-
#define MAXBUFFER 256
 
-
#define LEN_CMD 96
 
-
 
-
#define TEAM_NAME 1
 
-
#define DATABASE_NAME 2
 
-
#define YEAR 3
 
-
 
-
#define FALSE 0
 
-
#define TRUE 1
 
-
 
-
 
-
void chgchar( char *string, char c, char r, int len )
 
-
{
 
-
int i = 0;
 
-
 
-
while ( i < len )
 
-
{
 
-
if ( string[i] == c )
 
-
{
 
-
 
-
string[i] = r;
 
-
i++;
 
-
}
 
-
else
 
-
i++;
 
-
}
 
-
}
 
-
 
-
int close_sqlitedb( sqlite3*** conn )
 
-
{
 
-
if ( sqlite3_close(**conn) != SQLITE_OK )
 
-
{
 
-
printf("Error closing the db: %s\n", sqlite3_errmsg(*conn));
 
-
}
 
-
}
 
-
 
-
void read_next_word( char word[MAXBUFFER], FILE *fp )
 
-
{
 
-
char buffer[MAXBUFFER];
 
-
char ch;
 
-
int n = 0;
 
-
 
-
ch = fgetc(fp);
 
-
 
-
if( ch == EOF ) {
 
-
return;
 
-
}
 
-
 
-
while ( ch != 0x20 && ch != EOF && n < MAXBUFFER )
 
-
{
 
-
buffer[n] = ch;
 
-
n++;
 
-
ch = fgetc(fp);
 
-
}
 
-
buffer[n] = '\0';
 
-
strcpy( word, buffer );
 
-
}
 
-
 
-
int is_word_valid( char word )
 
-
{
 
-
// there's no filter at the moment.
 
-
// We can compare with MeSH white list ie.
 
-
}
 
-
 
-
int is_word_exists( char *word, sqlite3_stmt **statement, sqlite3** conn )
 
-
{
 
-
if ( sqlite3_bind_text( *statement, 1, word, -1, SQLITE_STATIC) != SQLITE_OK ) {
 
-
printf("Error binding to query select: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_step( *statement ) != SQLITE_DONE ) {
 
-
sqlite3_reset( *statement );
 
-
return TRUE;
 
-
} else {
 
-
sqlite3_reset( *statement );
 
-
return FALSE;
 
-
}
 
-
 
-
}
 
-
 
-
void add_to_the_list( char *word, sqlite3_stmt **statement, sqlite3** conn )
 
-
{
 
-
if ( sqlite3_bind_text( *statement, 1, word, -1, SQLITE_STATIC) != SQLITE_OK ) {
 
-
printf("Error binding to query insert: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
sqlite3_step( *statement );
 
-
sqlite3_reset( *statement );
 
-
}
 
-
 
-
void inc_occ( char *word, sqlite3_stmt **statement, sqlite3** conn )
 
-
{
 
-
// UPDATE dictionary SET global_occ = global_occ + 1 WHERE word=word
 
-
// INSERT into dictionary ...
 
-
if ( sqlite3_bind_text( *statement, 1, word, -1, SQLITE_STATIC) != SQLITE_OK ) {
 
-
printf("Error binding to query update: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
sqlite3_step( *statement );
 
-
sqlite3_reset( *statement );
 
-
}
 
-
 
-
void compute_freq( sqlite3** conn, float nbr_of_words, char *table )
 
-
{
 
-
// UPDATE dictionary SET global_freq = global_occ / nbr_of_word
 
-
// note that nbr_of_word must be a REAL type.
 
-
/* For each words we compute : F = occ_of_word / nbr_of_words */
 
-
char sqlite3_query[MAXBUFFER];
 
-
sqlite3_stmt *statement;
 
-
 
-
sprintf( sqlite3_query, "UPDATE %s SET global_freq = global_occ / %f",
 
-
table, nbr_of_words);
 
-
 
-
if ( sqlite3_prepare_v2(*conn, sqlite3_query, -1, &statement, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(*conn) );
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
sqlite3_step( statement );
 
-
sqlite3_finalize( statement );
 
-
}
 
-
 
-
make_dictionary( char *team_list, char *database_name)
 
-
{
 
-
FILE *fp, *fp_wiki;
 
-
float nbr_of_words = 0;
 
-
char word[MAXBUFFER], digest_name[MAXBUFFER-LEN_CMD], team_name[MAXBUFFER-LEN_CMD];
 
-
int size_name;
 
-
 
-
sqlite3_stmt *stmt_select, *stmt_insert, *stmt_update;
 
-
 
-
sqlite3 *conn;
 
-
if ( sqlite3_open( database_name, &conn ) != SQLITE_OK )
 
-
{
 
-
printf("Error opening the db: %s\n", sqlite3_errmsg(conn));
 
-
exit(1);
 
-
}
 
-
 
-
fp = fopen(team_list, "r");
 
-
 
-
if ( fp == 0 ) {
 
-
printf("Can't open %s file\n", team_list);
 
-
exit(1);
 
-
}
 
-
 
-
// pre-compile the query to optimize the process
 
-
if ( sqlite3_prepare_v2(conn, "SELECT * FROM dictionary WHERE word=?", -1,
 
-
&stmt_select, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_prepare_v2(conn, "INSERT INTO dictionary VALUES (?, 1, NULL)", -1,
 
-
&stmt_insert, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_prepare_v2(conn, "UPDATE dictionary \
 
-
SET global_occ = global_occ + 1 WHERE word=?",
 
-
-1, &stmt_update, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
while( !feof(fp) )
 
-
{
 
-
fgets( team_name, (MAXBUFFER - LEN_CMD), fp );
 
-
 
-
if ( feof(fp) ) {
 
-
break;
 
-
}
 
-
 
-
size_name = strlen( team_name );
 
-
team_name[size_name - 1] = '\0';
 
-
chgchar( team_name, '-', '_', size_name);
 
-
printf("On the %s Team\n", team_name);
 
-
 
-
if ( chdir(team_name) == -1 ) {
 
-
printf("error changing directory to %s\n", team_name);
 
-
exit(1);
 
-
}
 
-
 
-
strncpy( digest_name, team_name, (MAXBUFFER-LEN_CMD-4) );
 
-
strcat( digest_name, ".xtr");
 
-
 
-
fp_wiki = fopen( digest_name, "r" );
 
-
if( fp_wiki == 0 ) {
 
-
printf("error when you open %s file", digest_name);
 
-
exit(1);
 
-
}
 
-
 
-
if ( chdir("..") == -1 ) {
 
-
printf("error changing directory to ..\n");
 
-
exit(1);
 
-
}
 
-
 
-
 
-
sqlite3_exec(conn, "begin", NULL, NULL, NULL );
 
-
while ( !feof(fp_wiki) )
 
-
{
 
-
read_next_word( word, fp_wiki );
 
-
if ( is_word_valid( word ) == FALSE ) {
 
-
continue;
 
-
}
 
-
if ( is_word_exists( word, &stmt_select, &conn ) == FALSE ) {
 
-
add_to_the_list( word, &stmt_insert, &conn );
 
-
}
 
-
else {
 
-
inc_occ( word, &stmt_update, &conn );
 
-
}
 
-
 
-
nbr_of_words++;
 
-
}
 
-
sqlite3_exec(conn, "commit", NULL, NULL, NULL );
 
-
fclose(fp_wiki);
 
-
}
 
-
fclose(fp);
 
-
sqlite3_finalize( stmt_select );
 
-
sqlite3_finalize( stmt_insert );
 
-
sqlite3_finalize( stmt_update );
 
-
 
-
compute_freq( &conn, nbr_of_words, "dictionary" );
 
-
 
-
CLOSE_DB :
 
-
if ( sqlite3_close(conn) != SQLITE_OK ) {
 
-
printf("Error closing the db: %s\n", sqlite3_errmsg(conn));
 
-
}
 
-
}
 
-
 
-
void make_database( char *team_list, char *database_name )
 
-
{
 
-
sqlite3 *conn;
 
-
sqlite3_stmt *statement;
 
-
FILE* fp;
 
-
int size_name;
 
-
char team_name[MAXBUFFER-LEN_CMD], sqlite_query[MAXBUFFER];
 
-
 
-
if ( sqlite3_open( database_name, &conn ) != SQLITE_OK )
 
-
{
 
-
printf("Error opening the db: %s\n", sqlite3_errmsg(conn));
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_prepare_v2(conn, "CREATE TABLE dictionary ( word text,  \
 
-
global_occ int, global_freq real )", -1, &statement, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
sqlite3_step( statement );
 
-
sqlite3_finalize( statement );
 
-
 
-
fp = fopen(team_list, "r");
 
-
 
-
if ( fp == 0 ) {
 
-
printf("Can't open %s file\n", team_list);
 
-
exit(1);
 
-
}
 
-
 
-
while( !feof(fp) )
 
-
{
 
-
fgets( team_name, (MAXBUFFER - LEN_CMD), fp );
 
-
 
-
if ( feof(fp) ) {
 
-
break;
 
-
}
 
-
 
-
size_name = strlen( team_name );
 
-
team_name[size_name - 1] = '\0';
 
-
chgchar( team_name, '-', '_', size_name);
 
-
printf("Create %s Team Table\n", team_name);
 
-
 
-
sprintf( sqlite_query, "CREATE TABLE %s ( word text, global_occ int, \
 
-
global_freq real, sip real )", team_name);
 
-
if ( sqlite3_prepare_v2(conn, sqlite_query, -1, &statement, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
sqlite3_step( statement );
 
-
sqlite3_reset( statement );
 
-
 
-
}
 
-
fclose( fp);
 
-
sqlite3_finalize( statement );
 
-
 
-
CLOSE_DB :
 
-
if ( sqlite3_close(conn) != SQLITE_OK ) {
 
-
printf("Error closing the db: %s\n", sqlite3_errmsg(conn));
 
-
}
 
-
 
-
}
 
-
 
-
void download_wiki( char *team_list, char* year )
 
-
{
 
-
FILE *fp_src, *fp_tar, *fp;
 
-
char ch, last_ch;
 
-
char buffer[MAXBUFFER], team_name[MAXBUFFER-LEN_CMD], digest_name[MAXBUFFER-LEN_CMD];
 
-
int size_name;
 
-
 
-
fp = fopen(team_list, "r");
 
-
 
-
if ( fp == 0 ) {
 
-
printf("Can't open %s file\n", team_list);
 
-
exit(1);
 
-
}
 
-
 
-
while( !feof(fp) )
 
-
{
 
-
fgets( team_name, (MAXBUFFER - LEN_CMD), fp );
 
-
 
-
if ( feof(fp) ) {
 
-
break;
 
-
}
 
-
 
-
size_name = strlen( team_name );
 
-
team_name[size_name - 1] = '\0';
 
-
 
-
chgchar( team_name, '-', '_', size_name);
 
-
 
-
if ( mkdir(team_name, 0777) == -1 ) {
 
-
printf("error in the mkdir\n");
 
-
exit(1);
 
-
}
 
-
 
-
if ( chdir(team_name) == -1 ) {
 
-
printf("error changing the directory to %s\n", team_name);
 
-
exit(1);
 
-
}
 
-
 
-
if ( strcmp( year, "2007") == 0 ) {
 
-
snprintf(buffer, (MAXBUFFER-LEN_CMD), "wget -R.jpg -R.png -R.gif \
 
-
-R.jpeg -RUser:* -E -l1 -nd -r \
 
-
http://parts.mit.edu/igem07/index.php/%s", team_name);
 
-
printf("CMD : %s\n", buffer);
 
-
system(buffer);
 
-
} else {
 
-
snprintf(buffer, (MAXBUFFER-LEN_CMD), "wget -E -ITeam:%s -nd -r \
 
-
http://%s.igem.org/Team:%s", team_name, year, team_name);
 
-
printf("CMD : %s\n", buffer);
 
-
system(buffer);
 
-
}
 
-
 
-
snprintf(buffer, (MAXBUFFER-LEN_CMD), "html2text * > %s", team_name);
 
-
system(buffer);
 
-
 
-
strncpy( digest_name, team_name, (MAXBUFFER-LEN_CMD-4) );
 
-
strcat( digest_name, ".xtr");
 
-
 
-
fp_src = fopen(team_name, "r");
 
-
if ( fp == 0 ) {
 
-
printf("Can't open %s file\n", team_name);
 
-
exit(1);
 
-
}
 
-
 
-
fp_tar = fopen(digest_name, "w");
 
-
if ( fp == 0 ) {
 
-
printf("Can't open %s file\n", digest_name);
 
-
exit(1);
 
-
}
 
-
 
-
while( !feof(fp_src) )
 
-
{
 
-
ch = fgetc(fp_src);
 
-
 
-
if ( (ch >= 0x41 && ch <= 0x5A) ||
 
-
(ch >= 0x61 && ch <= 0x7A) ||
 
-
(ch == 0x20 && last_ch != 0x20) ||
 
-
ch == 0x2D ||
 
-
(ch >= 0x30 && ch <= 0x39) ) {
 
-
if (ch >= 0x41 && ch <= 0x5A) { //put in lowercase
 
-
ch += 0x20;
 
-
}
 
-
fputc(ch, fp_tar);
 
-
}
 
-
else {
 
-
ch = 0x20;
 
-
 
-
if ( last_ch != 0x20 ) {
 
-
fputc(ch, fp_tar);
 
-
}
 
-
}
 
-
last_ch = ch;
 
-
}
 
-
 
-
fclose( fp_src );
 
-
fclose( fp_tar );
 
-
 
-
if ( chdir("..") == -1 ) {
 
-
printf("error changing the directory to %s\n", team_name);
 
-
exit(1);
 
-
}
 
-
}
 
-
fclose(fp);
 
-
}
 
-
 
-
void compute_sip( sqlite3 **conn, char *team_name )
 
-
{
 
-
// SELECT * from TEAM_NAME;
 
-
// pour chaque result -> get the freq | get the word => get the freq of word in all wiki
 
-
// f/F
 
-
// UPDATE where word='yourword'
 
-
 
-
sqlite3_stmt *statement, *stmt_select, *stmt_update;
 
-
char sqlite_query[MAXBUFFER];
 
-
float local_freq, global_freq;
 
-
int res;
 
-
 
-
snprintf( sqlite_query, MAXBUFFER-LEN_CMD, "SELECT * from %s", team_name);
 
-
if ( sqlite3_prepare_v2(*conn, sqlite_query, -1, &statement, NULL ) != SQLITE_OK ) {
 
-
printf("compute_sip() : Error compiling the request 1: %s\n",
 
-
sqlite3_errmsg(*conn) );
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_prepare_v2(*conn, "SELECT * from dictionary WHERE word=?", -1,
 
-
&stmt_select, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("compute_sip() : Error compiling the request 2: %s\n",
 
-
sqlite3_errmsg(*conn) );
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
snprintf( sqlite_query, MAXBUFFER-LEN_CMD, "UPDATE %s SET sip = ?/? WHERE word=?",
 
-
team_name);
 
-
if ( sqlite3_prepare_v2(*conn, sqlite_query, -1, &stmt_update, NULL ) != SQLITE_OK ) {
 
-
printf("compute_sip() : Error compiling the request 3: %s\n",
 
-
sqlite3_errmsg(*conn) );
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
sqlite3_exec(*conn, "begin", NULL, NULL, NULL );
 
-
while( (res = sqlite3_step(statement)) == SQLITE_ROW)
 
-
{
 
-
local_freq = (float)sqlite3_column_double(statement, 2);
 
-
 
-
if ( sqlite3_bind_text( stmt_select, 1, sqlite3_column_text(statement, 0), -1,
 
-
SQLITE_STATIC) != SQLITE_OK ) {
 
-
printf("compute_sip() : Error binding 1: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
sqlite3_step( stmt_select );
 
-
global_freq = (float)sqlite3_column_double(stmt_select, 2);
 
-
 
-
if ( sqlite3_bind_double( stmt_update, 1, (double)local_freq) != SQLITE_OK ) {
 
-
printf("compute_sip() : Error binding 2: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_bind_double( stmt_update, 2, (double)global_freq) != SQLITE_OK ) {
 
-
printf("compute_sip() : Error binding 3: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
if ( sqlite3_bind_text( stmt_update, 3, sqlite3_column_text(statement, 0), -1,
 
-
SQLITE_STATIC) != SQLITE_OK ) {
 
-
printf("compute_sip() : Error binding 4: %s\n", sqlite3_errmsg(*conn));
 
-
close_sqlitedb( &conn );
 
-
exit(1);
 
-
}
 
-
 
-
sqlite3_step( stmt_update );
 
-
 
-
sqlite3_reset( stmt_select );
 
-
sqlite3_reset( stmt_update );
 
-
}
 
-
sqlite3_exec(*conn, "commit", NULL, NULL, NULL );
 
-
 
-
sqlite3_finalize( statement );
 
-
sqlite3_finalize( stmt_select );
 
-
sqlite3_finalize( stmt_update );
 
-
}
 
-
 
-
void make_sipword( char *team_list, char *database_name)
 
-
{
 
-
FILE *fp_wiki, *fp;
 
-
float nbr_of_words = 0;
 
-
char word[MAXBUFFER], digest_name[MAXBUFFER-LEN_CMD], team_name[MAXBUFFER-LEN_CMD];
 
-
char sqlite_query[MAXBUFFER];
 
-
int size_name;
 
-
 
-
sqlite3_stmt *stmt_select, *stmt_insert, *stmt_update;
 
-
 
-
sqlite3 *conn;
 
-
if ( sqlite3_open( database_name, &conn ) != SQLITE_OK )
 
-
{
 
-
printf("Error opening the db: %s\n", sqlite3_errmsg(conn));
 
-
exit(1);
 
-
}
 
-
 
-
fp = fopen(team_list, "r");
 
-
 
-
if ( fp == 0 ) {
 
-
printf("Can't open %s file\n", team_list);
 
-
exit(1);
 
-
}
 
-
 
-
while( !feof(fp) )
 
-
{
 
-
fgets( team_name, (MAXBUFFER - LEN_CMD), fp );
 
-
 
-
if ( feof(fp) ) {
 
-
break;
 
-
}
 
-
 
-
size_name = strlen( team_name );
 
-
team_name[size_name - 1] = '\0';
 
-
chgchar( team_name, '-', '_', size_name);
 
-
 
-
printf("On the %s Team\n", team_name);
 
-
 
-
// pre-compile the query to optimize the process
 
-
snprintf( sqlite_query, MAXBUFFER-LEN_CMD, "SELECT * FROM %s WHERE word=?",
 
-
team_name);
 
-
 
-
if ( sqlite3_prepare_v2(conn, sqlite_query , -1,
 
-
&stmt_select, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
snprintf( sqlite_query, MAXBUFFER-LEN_CMD, "INSERT INTO %s \
 
-
VALUES (?, 1, NULL, NULL)", team_name);
 
-
 
-
if ( sqlite3_prepare_v2(conn, sqlite_query, -1,
 
-
&stmt_insert, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
snprintf( sqlite_query, MAXBUFFER-LEN_CMD, "UPDATE %s \
 
-
SET global_occ = global_occ + 1 WHERE word=?", team_name);
 
-
 
-
if ( sqlite3_prepare_v2(conn, sqlite_query, -1,
 
-
&stmt_update, NULL ) != SQLITE_OK )
 
-
{
 
-
printf("Error compiling the request: %s\n", sqlite3_errmsg(conn) );
 
-
goto CLOSE_DB;
 
-
exit(1);
 
-
}
 
-
 
-
if ( chdir(team_name) == -1 ) {
 
-
printf("error changing directory to %s\n", team_name);
 
-
exit(1);
 
-
}
 
-
 
-
strncpy( digest_name, team_name, MAXBUFFER-LEN_CMD-4 );
 
-
strcat( digest_name, ".xtr");
 
-
 
-
fp_wiki = fopen( digest_name, "r" );
 
-
if( fp_wiki == 0 ) {
 
-
printf("error when you open %s file", digest_name);
 
-
exit(1);
 
-
}
 
-
 
-
if ( chdir("..") == -1 ) {
 
-
printf("error changing directory to ..\n");
 
-
exit(1);
 
-
}
 
-
 
-
 
-
sqlite3_exec(conn, "begin", NULL, NULL, NULL );
 
-
while ( !feof(fp_wiki) )
 
-
{
 
-
read_next_word( word, fp_wiki );
 
-
if ( is_word_valid( word ) == FALSE ) {
 
-
continue;
 
-
}
 
-
if ( is_word_exists( word, &stmt_select, &conn ) == FALSE ) {
 
-
add_to_the_list( word, &stmt_insert, &conn );
 
-
}
 
-
else {
 
-
inc_occ( word, &stmt_update, &conn );
 
-
}
 
-
 
-
nbr_of_words++;
 
-
}
 
-
sqlite3_exec(conn, "commit", NULL, NULL, NULL );
 
-
fclose(fp_wiki);
 
-
sqlite3_finalize( stmt_select );
 
-
sqlite3_finalize( stmt_insert );
 
-
sqlite3_finalize( stmt_update );
 
-
 
-
compute_freq( &conn, nbr_of_words, team_name );
 
-
compute_sip( &conn, team_name);
 
-
}
 
-
 
-
CLOSE_DB :
 
-
if ( sqlite3_close(conn) != SQLITE_OK ) {
 
-
printf("Error closing the db: %s\n", sqlite3_errmsg(conn));
 
-
}
 
-
}
 
-
 
-
int main( int argc, char *argv[] )
 
-
{
 
-
 
-
if ( argv[TEAM_NAME] == 0 || argv[DATABASE_NAME] == 0 || argv[YEAR] == 0 ) {
 
-
printf("usage : %s [list of team name] [name of database] [year]\n", argv[0] );
 
-
exit(1);
 
-
}
 
-
 
-
printf("start to download wiki !\n");
 
-
download_wiki( argv[TEAM_NAME], argv[YEAR] );
 
-
 
-
printf("make the database !\n");
 
-
make_database( argv[TEAM_NAME], argv[DATABASE_NAME] );
 
-
 
-
printf("start to make a dictionary !\n");
 
-
make_dictionary( argv[TEAM_NAME], argv[DATABASE_NAME] );
 
-
 
-
printf("start calculate SIP words !\n");
 
-
make_sipword( argv[TEAM_NAME], argv[DATABASE_NAME] );
 
-
 
-
return 0;
 
-
}
 
-
 
-
</pre>
 
 +
Our software's workflow is essentially as follows: We use wget to retrieve all of the wiki pages from the iGEM server.  We strip all non-alphanumeric characters and convert all remaining text to lower case.  Finally, we compute the relative frequencies of each SIP in an sqlite3 database.
 +
<font color="green">'''We have extracted and calculated this for the last 3 years of iGEM, the data is available under the Download tab, for you to play with if you like.'''</font>
 +
<br /><br />
 +
We have used a visualization app called [http://www.wordle.net/create Wordle] to visualize our SIP results.  Some of the graphics we made can be viewed under the Results tab!
<br />
<br />
<br />
<br />
-
</p>
+
In the future, we intend to make use of the ENTREZ API to get SIP data on each instructor's research output, and then compare the SIP's of instructors to those of their team to get an idea of how an advisor's research path influences their team's work.  We are interested to see how teams that do projects wildly divergent from their "mother-labs" do in iGEM.
-
== Downloads ==
+
<br />
-
<p style="display:block">
+
One other idea we had was to try to predict the future winner of the IGEM competition using our software. Basically the idea was to find SIPs for all the previous years' winners, giving them a coefficient  depending on the prize the project got. Once this done, it is possible to compare each wiki with this database of "winning" SIPs. Will this comparison give clues for the future winner ? To our mind, it will be interesting to know... But it still has to be implemented.
-
'''Team List'''
+
 
-
* [http://www.lsdlive.org/tmp-igem/download/teamlist_2009 Team list 2009]
+
<br /><br />
-
* [http://www.lsdlive.org/tmp-igem/download/teamlist_2008 Team list 2008]
+
-
* [http://www.lsdlive.org/tmp-igem/download/teamlist_2007 Team list 2007]
+
-
'''Wiki Data'''
+
-
* [http://www.lsdlive.org/tmp-igem/download/wdata_2009.tar.gz Wiki data 2009 (TARGZ)] <html><a href="#"> |</a></html> [http://www.lsdlive.org/tmp-igem/download/wdata_2009.tar.zip Wiki data 2009 (ZIP)]
+
-
* [http://www.lsdlive.org/tmp-igem/download/wdata_2008.tar.gz Wiki data 2008 (TARGZ)] <html><a href="#"> |</a></html> [http://www.lsdlive.org/tmp-igem/download/wdata_2008.tar.zip Wiki data 2008 (ZIP)]
+
-
* [http://www.lsdlive.org/tmp-igem/download/wdata_2007.tar.gz Wiki data 2007 (TARGZ)] <html><a href="#"> |</a></html> [http://www.lsdlive.org/tmp-igem/download/wdata_2007.tar.zip Wiki data 2007 (ZIP)]
+
-
'''SIP Database'''
+
-
* sip words database 2009
+
-
* sip words database 2008
+
-
* sip words database 2007
+
 +
We very much hope that other teams will make use of the data we've generated to run some of their own analyses!
<br />
<br />
<br />
<br />
Line 751: Line 85:
</p>
</p>
-
 
+
<html>
-
{{Template:Paris2010_menu_right}}
+
</div>
 +
</div>
 +
</html>

Latest revision as of 23:07, 27 October 2010



SIP Wiki Analyser





A new software tool to analyse iGEM projects


This year, iGEM team Paris has made a piece of software to analyze the word content of iGEM wikis. We initially wanted to try to use this software to predict who would win iGEM, but as it turns out it is useful for lots of other fun things as well.
To analyse the wikis, we have implemented an algorithm that uses Statistically Improbable Phrases (SIP) to try and extract the meaning of a text, an approach which has been used recently by Amazon to find relationships between books. An SIP is a word that appears with a higher frequency in your target (whether it is a book, or an iGEM wiki) than in a large dataset of background text. For example, a book about detectives might have "tweed jacket" or "corncob pipe" as SIPs because these words appear often in the book, but are not common words in the general corpus.

More formally :

  • f the frequency of the word in the target : f = o/n (o is the occurence of the word, and n the number of words in the wiki).
  • F the frequenc of the same word in a large sample : F = O/N (O, occurence in the sample, and N the number of words in the sample).
  • f/F is the "improbability factor" of a given SIP candidate.

Our algorithm uses all iGEM wikis as a background dataset, and then tries to find SIPs for each individual wiki. By using all of the wiki's text put together as a background, we are able to discard words that would be SIP's in a more general sense (clone, miniprep, igem) and instead focus on the words that make each project unique. Our software's workflow is essentially as follows: We use wget to retrieve all of the wiki pages from the iGEM server. We strip all non-alphanumeric characters and convert all remaining text to lower case. Finally, we compute the relative frequencies of each SIP in an sqlite3 database. We have extracted and calculated this for the last 3 years of iGEM, the data is available under the Download tab, for you to play with if you like.

We have used a visualization app called [http://www.wordle.net/create Wordle] to visualize our SIP results. Some of the graphics we made can be viewed under the Results tab!

In the future, we intend to make use of the ENTREZ API to get SIP data on each instructor's research output, and then compare the SIP's of instructors to those of their team to get an idea of how an advisor's research path influences their team's work. We are interested to see how teams that do projects wildly divergent from their "mother-labs" do in iGEM.
One other idea we had was to try to predict the future winner of the IGEM competition using our software. Basically the idea was to find SIPs for all the previous years' winners, giving them a coefficient depending on the prize the project got. Once this done, it is possible to compare each wiki with this database of "winning" SIPs. Will this comparison give clues for the future winner ? To our mind, it will be interesting to know... But it still has to be implemented.

We very much hope that other teams will make use of the data we've generated to run some of their own analyses!

References

  • http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases
  • http://petewarden.typepad.com/searchbrowser/2008/01/whats-the-secre.html