Team:VT-ENSIMAG/Registry
From 2010.igem.org
Line 2: | Line 2: | ||
__NOTOC__ | __NOTOC__ | ||
- | + | Screening the iGEM registry: | |
- | We | + | |
- | + | This summer we also screened the entire iGEM registry. This process was motivated by two reasons: | |
- | The | + | -test our program on a registry that would simulate the DNA orders got by a Gene Synthesis company. |
+ | -contribute to iGEM by verifying that the sequences present in the registry aren't dangerous | ||
+ | |||
+ | 1)Real World Gene Order simulation: | ||
+ | |||
+ | In the case of a real-life situation, such as a gene synthesis company processing thousand of sequences a day, the aim is not only to correctly detect dangerous sequences, but also to avoid having too many false hits. Indeed, each hit means that a human has to go manually look at the reason why the hit was raised, which costs a lot of money. | ||
+ | To simulate the orders that a gene synthesis company could get, we decided to screen the iGEM registry, which is a table of sequences completed each year by iGEM teams. It contained about 10,000 sequences; we screened them all. | ||
+ | After screening the first 1724 first sequences of this registry, the hit rate was at 6.5%, which is way too high in an industrial environment. | ||
+ | |||
+ | We look at where all those hits were coming from and, based on that, made a few minor changes to our program: | ||
+ | -first of all, there were a few issues with the keyword list, that we simply corrected right away | ||
+ | -often, some results of BLAST should have been contained, in theory, in the BLAST results, but were not because the fact of taking only the BLAST results that had a query coverage of 100% in the Best Matches was too restrictive. Therefore, we decided to put in the Best Matches, all of BLAST results which had a query coverage of 95% or greater | ||
+ | -sometimes the BLAST results contained matches of 100% query coverage that were Select Agents, and that lead to hits. But the percent identity of these matches was so low (often around 45% for protein sequences) that the similarity was too low to be considered serious. Therefore we added a condition for a BLAST result to be a Best Match: the percent identity needs to be at least 60%. | ||
+ | After making these changes and screening the entire iGEM registry, the hit rate went down to 2.9%. After making these changes to the program, we also reran some of the previous modified sequences, and we got results similar to before: so the detection capabilities of our program were not affected. | ||
+ | |||
+ | 2) Contribution to iGEM | ||
+ | Hopefully, there are not 2.9% of the sequences in the registry that are dangerous. After manually looking at the result, we find that many of these hits were false hits. Indeed, it’s mainly the keyword list and the way we use this keyword list that needs to be improved to decrease the number of hits. | ||
+ | The hits left are also often due to one sub-sequence of one amino-acid frame that leads to a hit. These kinds of sub-sequences just happened to be there because of random chance. We suspect that in a lot of cases, the amino-acid frame where the hit happens isn’t the one that the final user is intending to use. We are really wondering | ||
+ | if in that case, we should raise a hit or not. | ||
+ | |||
+ | So after getting rid of the sequences where only one subsequence of one frame lead to a hit and the sequences where keyword issues lead to a hit, we ended up with one true hit: | ||
+ | It’s the sequence: BBa_I10020. | ||
+ | Hopefully, iGEM was already aware that the sequence was dangerous. Indeed, on the main page of this sequence, there is a big WARNING indicating the fact that this sequence is potentially dangerous. | ||
+ | |||
<br> | <br> | ||
<br> | <br> |
Revision as of 16:01, 6 October 2010
Screening of the iGEM registry
|
Screening the iGEM registry: This summer we also screened the entire iGEM registry. This process was motivated by two reasons: -test our program on a registry that would simulate the DNA orders got by a Gene Synthesis company. -contribute to iGEM by verifying that the sequences present in the registry aren't dangerous 1)Real World Gene Order simulation: In the case of a real-life situation, such as a gene synthesis company processing thousand of sequences a day, the aim is not only to correctly detect dangerous sequences, but also to avoid having too many false hits. Indeed, each hit means that a human has to go manually look at the reason why the hit was raised, which costs a lot of money. To simulate the orders that a gene synthesis company could get, we decided to screen the iGEM registry, which is a table of sequences completed each year by iGEM teams. It contained about 10,000 sequences; we screened them all. After screening the first 1724 first sequences of this registry, the hit rate was at 6.5%, which is way too high in an industrial environment. We look at where all those hits were coming from and, based on that, made a few minor changes to our program: -first of all, there were a few issues with the keyword list, that we simply corrected right away -often, some results of BLAST should have been contained, in theory, in the BLAST results, but were not because the fact of taking only the BLAST results that had a query coverage of 100% in the Best Matches was too restrictive. Therefore, we decided to put in the Best Matches, all of BLAST results which had a query coverage of 95% or greater -sometimes the BLAST results contained matches of 100% query coverage that were Select Agents, and that lead to hits. But the percent identity of these matches was so low (often around 45% for protein sequences) that the similarity was too low to be considered serious. Therefore we added a condition for a BLAST result to be a Best Match: the percent identity needs to be at least 60%. After making these changes and screening the entire iGEM registry, the hit rate went down to 2.9%. After making these changes to the program, we also reran some of the previous modified sequences, and we got results similar to before: so the detection capabilities of our program were not affected. 2) Contribution to iGEM Hopefully, there are not 2.9% of the sequences in the registry that are dangerous. After manually looking at the result, we find that many of these hits were false hits. Indeed, it’s mainly the keyword list and the way we use this keyword list that needs to be improved to decrease the number of hits. The hits left are also often due to one sub-sequence of one amino-acid frame that leads to a hit. These kinds of sub-sequences just happened to be there because of random chance. We suspect that in a lot of cases, the amino-acid frame where the hit happens isn’t the one that the final user is intending to use. We are really wondering if in that case, we should raise a hit or not. So after getting rid of the sequences where only one subsequence of one frame lead to a hit and the sequences where keyword issues lead to a hit, we ended up with one true hit: It’s the sequence: BBa_I10020. Hopefully, iGEM was already aware that the sequence was dangerous. Indeed, on the main page of this sequence, there is a big WARNING indicating the fact that this sequence is potentially dangerous.
|