Team:VT-ENSIMAG/Registry
From 2010.igem.org
Line 4: | Line 4: | ||
Screening the iGEM registry: | Screening the iGEM registry: | ||
- | This summer we also screened the entire iGEM registry. This | + | This summer we also screened the entire iGEM registry. This screen had a two-fold purpose: |
-test our program on a registry that would simulate the DNA orders got by a Gene Synthesis company. | -test our program on a registry that would simulate the DNA orders got by a Gene Synthesis company. | ||
-contribute to iGEM by verifying that the sequences present in the registry aren't dangerous | -contribute to iGEM by verifying that the sequences present in the registry aren't dangerous | ||
Line 13: | Line 13: | ||
In the case of a real-life situation, such as a gene synthesis company processing thousand of sequences a day, the aim is not only to correctly detect dangerous sequences, but also to avoid having too many false hits. Indeed, each hit means that a human has to go manually look at the reason why the hit was raised, which costs a lot of money. | In the case of a real-life situation, such as a gene synthesis company processing thousand of sequences a day, the aim is not only to correctly detect dangerous sequences, but also to avoid having too many false hits. Indeed, each hit means that a human has to go manually look at the reason why the hit was raised, which costs a lot of money. | ||
To simulate the orders that a gene synthesis company could get, we decided to screen the iGEM registry, which is a table of sequences completed each year by iGEM teams. It contained about 10,000 sequences; we screened them all. | To simulate the orders that a gene synthesis company could get, we decided to screen the iGEM registry, which is a table of sequences completed each year by iGEM teams. It contained about 10,000 sequences; we screened them all. | ||
- | After screening the first 1724 first sequences of this registry, the hit rate was | + | After screening the first 1724 first sequences of this registry, the hit rate was reduced from 6.5% to 2.9%, by altering certain program parameters. This result highlights the program's customizable nature and potential for optimization. |
- | We | + | We looked into the causes of all those hits, and as mentioned above made a few minor changes to our program: |
-first of all, there were a few issues with the keyword list, that we simply corrected right away | -first of all, there were a few issues with the keyword list, that we simply corrected right away | ||
-often, some results of BLAST should have been contained, in theory, in the BLAST results, but were not because the fact of taking only the BLAST results that had a query coverage of 100% in the Best Matches was too restrictive. Therefore, we decided to put in the Best Matches, all of BLAST results which had a query coverage of 95% or greater | -often, some results of BLAST should have been contained, in theory, in the BLAST results, but were not because the fact of taking only the BLAST results that had a query coverage of 100% in the Best Matches was too restrictive. Therefore, we decided to put in the Best Matches, all of BLAST results which had a query coverage of 95% or greater | ||
- | -sometimes the BLAST results contained matches of 100% query coverage that were Select Agents, and that | + | -sometimes the BLAST results contained matches of 100% query coverage that were Select Agents, and that led to hits. But the percent identity of these matches was so low (often around 45% for protein sequences) that the similarity was too low to be considered serious. Therefore we added a condition for a BLAST result to be a Best Match: the percent identity needs to be at least 60%. |
After making these changes and screening the entire iGEM registry, the hit rate went down to 2.9%. After making these changes to the program, we also reran some of the previous modified sequences, and we got results similar to before: so the detection capabilities of our program were not affected. | After making these changes and screening the entire iGEM registry, the hit rate went down to 2.9%. After making these changes to the program, we also reran some of the previous modified sequences, and we got results similar to before: so the detection capabilities of our program were not affected. | ||
Line 31: | Line 31: | ||
It’s the sequence: BBa_I10020. | It’s the sequence: BBa_I10020. | ||
- | + | Luckily, iGEM was already aware that the sequence was dangerous. Indeed, on the main page of this sequence, there is a big WARNING indicating the fact that this sequence is potentially dangerous. This result shows that GenoTHREAT is capable of identifying potentially dangerous sequences in the registry. Therefore, it could be used as a tool to screen incoming sequences in order to ensure the safety of the students using the registry as well as those around them. | |
<br> | <br> | ||
<br> | <br> | ||
}} | }} |
Latest revision as of 04:18, 19 October 2010
Screening of the iGEM registry
|
Screening the iGEM registry: This summer we also screened the entire iGEM registry. This screen had a two-fold purpose: -test our program on a registry that would simulate the DNA orders got by a Gene Synthesis company. -contribute to iGEM by verifying that the sequences present in the registry aren't dangerous
In the case of a real-life situation, such as a gene synthesis company processing thousand of sequences a day, the aim is not only to correctly detect dangerous sequences, but also to avoid having too many false hits. Indeed, each hit means that a human has to go manually look at the reason why the hit was raised, which costs a lot of money. To simulate the orders that a gene synthesis company could get, we decided to screen the iGEM registry, which is a table of sequences completed each year by iGEM teams. It contained about 10,000 sequences; we screened them all. After screening the first 1724 first sequences of this registry, the hit rate was reduced from 6.5% to 2.9%, by altering certain program parameters. This result highlights the program's customizable nature and potential for optimization. We looked into the causes of all those hits, and as mentioned above made a few minor changes to our program: -first of all, there were a few issues with the keyword list, that we simply corrected right away -often, some results of BLAST should have been contained, in theory, in the BLAST results, but were not because the fact of taking only the BLAST results that had a query coverage of 100% in the Best Matches was too restrictive. Therefore, we decided to put in the Best Matches, all of BLAST results which had a query coverage of 95% or greater -sometimes the BLAST results contained matches of 100% query coverage that were Select Agents, and that led to hits. But the percent identity of these matches was so low (often around 45% for protein sequences) that the similarity was too low to be considered serious. Therefore we added a condition for a BLAST result to be a Best Match: the percent identity needs to be at least 60%. After making these changes and screening the entire iGEM registry, the hit rate went down to 2.9%. After making these changes to the program, we also reran some of the previous modified sequences, and we got results similar to before: so the detection capabilities of our program were not affected.
Hopefully, there are not 2.9% of the sequences in the registry that are dangerous. After manually looking at the result, we find that many of these hits were false hits. Indeed, it’s mainly the keyword list and the way we use this keyword list that needs to be improved to decrease the number of hits. The hits left are also often due to one sub-sequence of one amino-acid frame that leads to a hit. These kinds of sub-sequences just happened to be there because of random chance. We suspect that in a lot of cases, the amino-acid frame where the hit happens isn’t the one that the final user is intending to use. We are really wondering if in that case, we should raise a hit or not. So after getting rid of the sequences where only one subsequence of one frame lead to a hit and the sequences where keyword issues lead to a hit, we ended up with one true hit: It’s the sequence: BBa_I10020. Luckily, iGEM was already aware that the sequence was dangerous. Indeed, on the main page of this sequence, there is a big WARNING indicating the fact that this sequence is potentially dangerous. This result shows that GenoTHREAT is capable of identifying potentially dangerous sequences in the registry. Therefore, it could be used as a tool to screen incoming sequences in order to ensure the safety of the students using the registry as well as those around them.
|