Team:Hong Kong-CUHK/Project principle
Bioencryption by recombination--Principle
Site-specific recombination systems are classified into two distinctive groups, integration-excision and inversion systems. Our Shufflon system use the latter one. In this shufflon system, Rci-mediated recombination occur between any repeat sequences causing inversion of the DNA segments independently or in groups.
Rci-dependent deletion of shufflon segment flanked by the natural repeat sequences was not occurred, i.e. , the DNA sequences between repeat after recombination are conserved and no loss of DNA sequence was found.
For the repeat sequence, there are mainly four group.s, repeat a, b, c, d. There are seven different repeat sequence in nature. There are repeat 1-7. Repeat 1,2 belong to repeat a, repeat 4, 6, 7 belong to repeat b, repeat 5 belongs to repeat c, and repeat 3 belongs to repeat d.
Experiments showing that the inversion frequency with DNA sequences flanked by two repeat a is the best, and it is much higher than that with any two combination of repeat a, b, c, d flanking the DNA sequence.
There are 12 bp sequences before every 19 bp repeat sequence. With this sequence, it can further enhance the inversion frequency, while the mechanism is yet unknown. In our project, we just need to exploit the shufflon system for recombination. Therefore, we added the specific 12 bp sequences before every 19 bp repeat sequence we added.
For the Rci recombinase, it is shown that the inversion caused by the wide type (WT) is greater than that with modified, or point mutation at some positions of rci gene. Therefore, we decide to use wide type rci recombinase in our project.
Rci recombination system:
In our project, we constructed a Rci recombination system, with regulation of expression.
First, a promotor, which was a lac operon, was located in the beginning of the system. It allowed users to use IPTG as a regulator to control expression of rci recombinase. The Ribosome binding site (RBS) was allocated at second position, next to rci recombinase at third place. RBS allowed ribosome binding before translational process, so ribosome can translate the rci gene right after RBS gene. Finally, a bidirectional double terminator was allocated at the forth position. It allowed terminator on both direction of transcription, so there was no reverse transcription of rci gene into mRNA which led to wrong protein formation.
Translation is just the first step
Our system takes in a progressive approach when transforming information to DNA.
A translation table would first need to be constructed by the client, the extended ASCII table with 256 characters were used as standard in here. It is not difficult to identify DNA as a naturally referred as a quaternary numeral system, With the DNA base adenosine representing the number “0”, thymine representing “1”, cytosine representing “2” and guanine representing “3”, we are essentially encoding the 256 characters with this base-4 numeral system.
Compression is the key
Before subjecting the DNA sequences to synthesis, a compression step is subsequent to the translation process.
Deflate – renowned as a lossless data compression algorithm that uses a combination of Huffman coding and LZ77 algorithm, this compression process is beneficial in two aspects – firstly, more information could be included when comparing to the uncompressed message of the same length and secondly, homopolymer and repetitive regions could be reduced significantly. This is fundamentally crucial to the infrastructure of the DNA storage system as homopolymer and repetitive regions in DNA sequences are devastating to both DNA synthesis and sequencing, with the compression algorithms these cases would be minimal.
An infrastructure to the true, massively parallel storage system
Incorporating a short message is not our purpose, instead we are pursuing for a true massively parallel storage system that one can systematically incorporate useful information neglecting its size.
In order to store a large piece of information such as a photograph or a dictionary, it is impossible to include it within a single piece of DNA as this is limited by the current DNA synthesis technology. One approach is to fragment the information into pieces and insert them into the cells. However simply fragmenting the information followed by insertion to the cell would destroy all the data, as the order of these fragments is unknown. To overcome such an obstacle, a novel information system was invented. Each sequence that we are inserting into the bacterial cell composes of three sectors – Headers, Messages and Checksum. Header is the address of that particular message fragment, which consist of 8 DNA bases with each 2 bases as one unit – namely zone, region, area and district. The message is self-explanatory – the message fragment itself and the checksum is an identification and correction system for minor mutations.
Decryption is not simple, it consists of a three-tier security fencing – encoding system, encryption system and checksum system, the message could only be retrieved when enough information is provided. Here shows the design of a single data fragment:
The full message can be restored from data fragments through a series of steps:
Step 1 : next generation high-throughput sequencing (NGS) and assembling
With the information-encrypted bacteria provided, the plasmid DNA would be extracted and subjected to next-generation high-throughput sequencing (NGS). A reason to choose high-throughput sequencing instead of ordinary sequencing technology would be NGS is a massively parallel sequencing process, which means there must exist multiple copies of sequencing products (reads) that could cover a particular message stored within the DNA, these multiple copies of reads could enable us to perform a majority voting on bases for which qualities are not the best. Moreover with the current reads assembling algorithms available – velvet and euler for example, assembling the reads from NGS is no longer a formidable task.
Step 2 : Identification of repeat sequences, messages and checksum
The second tier, with the given encryption system – like R64 shufflon system in this case, the repeats are known. The repeats could be recovered by using alignment tools such as BLAST and the sequences in between the repeats would be regarded as the fragment of messages, with unknown order however. The checksum is right behind the last repeat sequence.
Final Step : Combinatorial problem
The third tier, only the client would know the function to derive the checksum. With the checksum formula, we are just one step before reaching our goal – recovering the correct message. With different fragments of messages provided, they are concatenated in different permutations; fit the trial into the checksum formula, compare with that on the sequence and BINGO if they are the same, or if not one would have to try again.