Team:Paris Liliane Bettencourt/Project/SIP/Downloads
From 2010.igem.org
(Difference between revisions)
Line 53: | Line 53: | ||
<i><b>Warning :</b> Notice these files are generated using "links -dump" to remove html, to speed the process, but you can do without that, because SIP will remove them later. With links, some pages with special characters like '(' ')' and ':' in their name are not converted, we consider it's not very important, because it's a small number of pages, but you can re-gen the database without the html parse step.<br /> | <i><b>Warning :</b> Notice these files are generated using "links -dump" to remove html, to speed the process, but you can do without that, because SIP will remove them later. With links, some pages with special characters like '(' ')' and ':' in their name are not converted, we consider it's not very important, because it's a small number of pages, but you can re-gen the database without the html parse step.<br /> | ||
You can also use html2text, but if the software find special character, it don't remove the html.<br /> | You can also use html2text, but if the software find special character, it don't remove the html.<br /> | ||
- | Also, | + | Also, some team missed in databases (28 for 2009, and 10 for 2008) : to make the database, I reverse '-' char by '_', cause sqlite3 don't work with this char, but I forgot to change the name to download the team, so the url was bad, and files were not downloaded. Each team with '-' char was not downloaded. I can't re-gen the database, cause it take me a lot of time to compute that, so I've just recompute team with '-' char (dictionary table is unchanged). |
<br /> | <br /> | ||
<br /> | <br /> |
Revision as of 08:19, 27 October 2010
Team List
- Team list 2007 (UNIX) | Team list 2007 (WIN32)
- Team list 2008 (UNIX) | Team list 2008 (WIN32)
- Team list 2009 (UNIX) | Team list 2009 (WIN32)
- [http://www.lsdlive.org/misc/wdata_2007.zip Wiki data 2007 (ZIP)]
- [http://www.lsdlive.org/misc/wdata_2008.zip Wiki data 2008 (ZIP)]
- [http://www.lsdlive.org/misc/wdata_2009.zip Wiki data 2009 (ZIP)]
- SIP words database 2007 (SQLITE3)
- [http://www.lsdlive.org/misc/wsip_2008.db.zip SIP words database 2008 (SQLITE3)]
- [http://www.lsdlive.org/misc/wsip_2009.db.zip SIP words database 2009 (SQLITE3)]
To read databases, use [http://www.sqlite.org/ sqlite3].
Warning : Notice these files are generated using "links -dump" to remove html, to speed the process, but you can do without that, because SIP will remove them later. With links, some pages with special characters like '(' ')' and ':' in their name are not converted, we consider it's not very important, because it's a small number of pages, but you can re-gen the database without the html parse step.
You can also use html2text, but if the software find special character, it don't remove the html.
Also, some team missed in databases (28 for 2009, and 10 for 2008) : to make the database, I reverse '-' char by '_', cause sqlite3 don't work with this char, but I forgot to change the name to download the team, so the url was bad, and files were not downloaded. Each team with '-' char was not downloaded. I can't re-gen the database, cause it take me a lot of time to compute that, so I've just recompute team with '-' char (dictionary table is unchanged).
Notes about filters : In these files, there're no filters but you can make what you want : remove common name, keep only [http://www.nlm.nih.gov/mesh/ MeSH] terms etc. See what you need!