Revision as of 17:51, 26 October 2010 by Glh (Talk | contribs)

The Promoter and 5' UTR

The Eukaryotic Promoter

The promoter is a stretch of DNA anywhere from a hundred to two thousand bases in length which prefaces the segment of DNA that will be transcribed. The promoter serves as one of the largest centers of regulatory activity; many different cofactors are typically involved in determining whether or not a transcript should be created for a gene. For a long time, it was presumed to be the only point in the central dogma at which regulation occurs, but it has since been determined that each step of the DNA-to-RNA-to-protein-to-destination has some amount of control in establishing the fate of a gene product.

In C. elegans, the promoter has not been studied very extensively. Because of the efficiency of trans-splicing], it has been impossible to determine the actual start of transcription for most genes from pre-mRNA using traditional techniques from molecular genetics. It is generally assumed that the majority of steps closely reflect those of other, better-studied eukaryotes like yeast, humans, and fruitflies, and even WormBook falls back on material from these sources in its description of the pre-initiation process, which is a must-read for those interested in synthetic worm biology and cannot be usefully summarized here.

Regulation of transcription in C. elegans is a messy subject, as it is with many eukaryotic genes. Regulatory elements may be located in introns, several kilobases upstream, or, as in the case of egl-1, downstream, on the other side of an unrelated gene. lin-39 required 30 kb of surrounding DNA to reproduce its expression pattern faithfully—but these are exceptional cases. The majority of C. elegans protein-coding genes occur in tightly-packed groups, where there is no room for unusual structures such as these, and their promoters, including remote elements, can be adequately contained within less than 2000 bases; some are less than a hundred.

Engineering a synthetic promoter or promoter system for C. elegans has the potential to be quite a substantial and worthwhile project, but the amount of work involved at present may be prohibitive. See “Transcriptional regulation” in WormBook for a starting point on information of what is known of promoter elements in the worm.


An operon is a genetic structure in which one promoter is followed by multiple coding sequences. These coding sequences are transcribed as one, and then typically separated by splicing the mRNA transcript. Operons are used and studied extensively in prokaryotes, with the lac operon of E. coli serving as a standard introduction to gene structure in second-year genetics courses.

While at WormGuide we generally try to use “gene” to mean a structure that includes the promoter, the 5' UTR, the CDS, and the 3' UTR, nomenclature often breaks down when this structure is violated. Geneticists traditionally have meant specifically the protein-coding (or functional RNA transcript) sequence, and sometimes surrounding the sequence that is specific to it. This definition has constantly been a matter of confusion over the years, and the issue has only gotten worse with the blossoming of genomics.

Up until the early nineties, it was believed that eukaryotes had fully rejected operons in favor of more elaborate control mechanisms, but Spieth et al. determined this was not the case in 1993. There are approximately 1000 operons in C. elegans, containing more than a tenth of the worm’s genes. In general, these operons contain proteins that must be expressed in controlled ratios to each other and interact with one another, especially essential genes such as mitochondrial, transcriptional, splicing, and translational machinery.

Operons have proven themselves to be immensely useful in the construction of BioDevices in E. coli, and will likely prove to be just as valuable in C. elegans. They also must be taken into consideration when searching for promoters, as they mean that the region upstream of a CDS may not actually contain any regulatory or polymerase-binding sites, especially for high-activity genes. For information on constructing operons, see the section on trans-splicing.

The Kozak Consensus Sequence

In prokaryotes such as E. coli, the Shine-Dalgarno sequence, located just before the translational start, acts as a ribosomal binding site (RBS). In absence of the SD sequence, expression is greatly reduced or eliminated.

There is no direct analog of this in eukaryotes. The ribosome latches onto a structure of cofactor proteins that have bound to certain features at the start of the transcript, and then scans down the mRNA until it finds a start codon. A ‘naked’ start codon is sufficient to start translation, but it is very weak, and it is likely that the ribosome will skip it over and not express the gene.

To improve the efficacy of translational initiation, a larger sequence is used. The more precisely this matches certain well-known patterns, the more likely it is that the ribosome will stop scanning at the correct point and begin translation.

The canonical pattern for a Kozak sequence is as follows:


Where R is A or G (more commonly A), AUG is the start codon, and the final G is the first nucleotide of the first codon of the gene itself. Wikipedia has some examples of weaker consensus sequences.

The Promoter in C. elegans Research

In most research biology studies of C. elegans, the focus is on physiological and developmental processes, not the central dogma or transcriptional regulatory mechanisms. As a consequence, although transcriptional start sites are sometimes known, when assembling constructs, biologists typically elide the difference between the upstream portion of the transcript and the promoter itself—in fact, for many genes, even the start of the promoter is not known, and a large portion of the intergenic region is used instead. While messy, this saves time and is effective as a safeguard against accidentally losing promoter elements. In combination with a protein’s original 3' UTR, exact replication of an expression pattern can be ensured.

Note: in order to be compatible with our BioBricks, modification to these promoter sequences is required. Specifically, we remove the start codon from the Promoterome record and place it with an artificial Kozak sequence at the start of the protein in order to conceal ligation scars. As a result, following our standard exactly will typically produce a higher-than-normal level of expression. See Contributing for more information.

Continue to Introns and Transcripts