Team:METU Turkey Software

From 2010.igem.org

Revision as of 23:46, 26 October 2010 by Fsyauqy (Talk | contribs)

Team

METU Turkey Software is an interdisciplinary team of 8 students and 3 advisors from various backgrounds such as Molecular Biology, Bioinformatics, Computer Engineering and Computer Education and Instructional Technology. We have put our knowledge and experience in our fields together to bring a much needed solution to a daily problem in field of synthetic biology for iGEM 2010

Motivation

Since 2008, we have been participating in iGEM as METU ( Middle East Technical University) wet-lab team, and each year we have noticed the increasing number of teams participating, along with an increase in biobricks entries at partsregistry.org. While having more choices of biobricks to choose from is incredible, searching for and choosing the appropriate parts is becoming a challenge. This year during the construction process of iGEM biobrick parts for our new project, we felt the need for an application to find interacting parts based on an input/output model to design the genetic constructs. Using a specialized software for searching the parts registry to find possible biobricks to include into our construct would be much easy, fast and accurate than manual. We have shared our need with a group of friends who are software engineers, and initiated the METU_Turkey_SOFTWARE team where we worked together over this summer to build the BIO-Guide software.

Scope and Future Aspects

The partsregistry.org is a continuously growing collection of standard genetic parts that can be mixed and matched to build synthetic biology devices and systems. The Registry is based on the principle of "get some, give some". Registry users benefit from using the parts and information available in the Registry for designing their own genetically engineered biological systems. In exchange, the expectation is that Registry users will contribute back to the information and the data on existing parts and will submit new parts they have designed in order to improve this community resource.

As an expanding database partsregistry.org needs to be more organized and the standardization template needs to be improved. Additionally, the potential of multiple ways of using each part in different construct combination brings out the necessity for an application to search through the database. BIO-Guide is the first designed software that organizes over 1000 parts in partsregistry.org as possible atomics parts to build new biological device and systems for specific input and outputs based on graph theory. The requirement of similar applications and software tools are now inevitable in the emerging field of synthetic biology. The innovative approach that makes the partsregistry.org easy to use for synthetic biology applications is the collection of standardized parts that can be used in any combination with minimal effort under one database. But while working on our algorithm to search for possible combinations of parts depending on the given input and output, we have realized that present standards are inadequate and parts registry form must be improved.

In very near future a new format for parts registry form is needed and few additional features should be implemented to have more control on the database. We are planning to suggest a new format and features for the parts registry based on the survey results we have received. And planning to build the next version of BIO-guide based on the revised parts registry form. Along with using new parts registry standards we will be improving the algorithm, so that the software can search through more complex relations and returns all possible functional constructs.

Project Introduction

As the field of Synthetic Biology is on the rise, iGEM is growing up very fast and the number of parts in the parts registry is increasing with the addition of more complex parts each day. After facing some difficulty while running our algorithms on the parts registry, the need for more effective standardization of parts entry was apparent. We have investigated the information on parts in iGEM’s 2010 distribution and reorganized the information on the parts registry forms according to the needs of our algorithm. Then we have used graph theoretic modeling to visualize the relations between iGEM Parts and to standardize the representation of the parts as much as possible by graph theoretical methods. This helped us to find input output relations between the parts. Furthermore, our program BIO-Guide is now able to provide alternative pathways to construct the most reliable and functional Biobrick devices with respect to given inputs and expected outputs as a guide to Biobricks parts registry.

Notebook
Download
Misc - Collaboration
Design
Code
Human Practices
Material
User Guide
Safety

Methods

Part Extraction Standards

All information about the parts that are essential in experimental setup of iGEM projects has been utilized. The information for the parts available provided with all three 384 well plates in Spring 2010 distribution have been standardized. Our standardization criteria have been discussed in detail under Database Standardization. ER diagram has been generated which simply describes the organization of the data. Around 70% of the parts information has been fetched by the custom parsing code from XML and Excel files provided by iGEM. Rest of the data had to be collected and organized manually as the organization of these data cannot be standardized to generate an algorithm. This step was one of the most time consuming steps in our project. For each construct and Biobrick the information collected was; Activity, Inducer, Activator, Repressor and Inhibitor for promoters and Inducer, Activator, Repressor and Inhibitor information valid for synthesized molecules (mostly proteins and RNA fragments etc.)

Combination

Rules (Image Combinations) In order to build our input/output relations graphs first we run our algorithm on the real combination dataset which contains all few thousand different possible combinations of the biobricks. But after performing all combinations for the first few hundred biobricks application’s rate slowed downed tremendously, which also become very time consuming for displaying biobricks graphs. To overcome this bottleneck we have developed a new strategy, where we have only used the construct combinations of the biobricks distributed within the plates. Moreover, according to information gathered from the subparts of the constructs distrubuted, we also collected the subpart assembly order, such as 1st: promoter, 2nd:rbs, 3rd:coding seq, any internal parts and the Last: terminator. Each specific Biobrick type has been assigned a number as a unique image ID from 1 to 19. Gathering the information on subparts was not a direct forward process. ImageID assembly orders for each construct has been used to extract the type information for each subpart with that construct. This innovative approach helped us to reveal 400 possible brick combinations present within the 3x384 well plates distributed by iGEM in Spring 2010.

Support
Future Plan

Database Standardization

Two main focuses of our project was the organization of the available information about Biobricks on iGEM’s website and development of a software application to help synthetic biologists at the experimental set-up level by providing all available construct combinations for any given input and output relations ,which they can utilize for their own project.

Normalization and re-organization of the part information at iGEM’s web site was needed in order to develop our application, which will automatically search the possible construct combinations. For the organization and analysis of the Biobricks, we used part info for Spring 2010 distribution. The information on all three 384 well plates distributed by iGEM scrutinized and checked individually to specify the standards available and needed. iGEM is providing so many parts within a hierarchical way, but there is no order in the information flow and no common standards. Furthermore, the information bulk is being used in an ineffective manner. Some of the parts distributed are known to be nonfunctional. Web pages for parts contain lots of information, but majority of them, are again not ordered. Moreover, some additional information had to be removed or replaced in such a way that the information for parts can be used effectively. And removal of the redundant bulk information related with parts at iGEM’s web site had been recommended for future.

Although, the final standardization, which we have suggested is not for general public use and it was urgently needed in order to satisfy the needs of our algorithm. But, still it will be a valuable resource, since it summarizes the basic information about the parts.

As the first step to build the proposed standardization template, the headings selected related to parts are listed on Table 1. Submission of part IDs for individual parts is an accepted and quite valuable way of tracking information. Although, every part has unique partID, for every part there is a need to assign unique part names as official iGEM names. Part names will have an important role as they will be providing the short description about the part, which synthetic biologists can immediately recognize and utilize during the construction of unique Biobricks. Additionally unique part names will be helpful to identify the devices with more than one Biobrick in their constructs. Assignment of unique and distinct names for parts describing their nature and content will be helpful to researchers for the recognition of and search for the parts.


Headings Selected From Previous Entry Forms for Indication of Standardized Information

=========================================

PartID:

PartName:

Bricks:

BrickIDs:

ImageIDs:

RFC10:

RFC21:

RFC23:

RFC25:

=========================================

Table 1: The table above basically describes and designates qualities of parts which identifies their compositions and demonstrates the status of previously assigned standards. PartID refers to the unique ID number for parts including atomic parts and assemblies. PartName refers to the given unique names to parts. Bricks, refers to the shortcut names which specifies atomic parts. ImageIDs, refers to individual or combination of numbers that are assigned by us. RFCs refers to the states of parts based on RFC standards.

iGEM both provides individual, atomic parts and pre-combined constructs such as devices and systems. Availability of combined constructs is important to the researchers as combining individual bio-bricks one at a time will be very time consuming. These previously merged constructs, serve as the repository for puzzle and they can be used for different purposes. Up to date the largest and most trustworthy source, for synthetic biology and its components, is iGEM’s parts registry. In 2010, iGEM provided over 1000 parts that have initiated many projects. Having more atomic parts available in the iGEM’s repository, will lead to the design of more complex and robust constructs, and we would have a better chance to design different constructs for unique purposes. Also, for the parts that are already available, extra steps needs to be taken for the quality control and surveillance of these products. The quality control of the information for the parts is essential for the future of iGEM and synthetic biology. Even though we have found pre-determined RFC standards useful and included those to our standardized template, some individual parts still requires re-organization of the information as RFC standards alone for the functionality of parts, does not satisfy the needs for wet lab biologists.

Without a question there is an urgent need to build a distinct and specific database well organized with its own standards for synthetic biology; however, development of such a database is not an easy task.


Contact Information of Part Owners and Qualitative Group Comments about Parts

=========================================

Designers: Mail:

GroupFavorite:

StarRating:

Parameters:

=========================================

Table 2: The above table simply depicts information about possessors of parts and their contact information and the popularity of the parts for groups. Parameters heading, refers distinctive experimental details unique to the usage of parts which should be decided by groups.

Second step for building the standardized template was to get the phylogenic information about the parts development process which includes the name of the group, designer and contact information, along with the comments from the group on the parts they have submitted. Contact information is especially important for iGEM as other groups who need extra information about the available part can reach to the required information. Even though contacting with the designers of the individual parts which are available is highly encouraged by iGEM, unavailability of contact information points at out the fact that iGEM’s parts registry needs strong re-organization in order to serve to the synthetic biology community properly.

Additionally, the “group favorite” and “starRating” fields are also important for individual evaluation of the parts, which doesn’t get the deserved attention from the iGEM groups. “Group Favorite” defines the confidence on the part by the designer group. “StarRating” defines the related part in terms of popularity and usage efficiency among the groups. According to our observations, most groups are not aware of either of the fields or they are used incorrectly or ineffectively. For example for a part with a full reporter which is known to be functional and gives precise and expected results the StarRating should be at least 2 stars, but for most of the parts in 2010 distribution, it is very difficult to observe a part whose “StarRating” is above one. For quick determination of functionality of the parts these two evaluations are important so they have been included in the proposed standardization template. But, as they were not properly used up to now for the re-organization of the parts information during the development of our software application we had to include all parts to our queries regardless of their evaluations based on “Group Favorites” and “ StarRatings”

Second step for building the standardized template was to get the phylogenic information about the parts development process which includes the name of the group, designer and contact information, along with the comments from the group on the parts they have submitted. Contact information is especially important for iGEM as other groups who need extra information about the available part can reach to the required information. Even though contacting with the designers of the individual parts which are available is highly encouraged by iGEM, unavailability of contact information points at out the fact that iGEM’s parts registry needs strong re-organization in order to serve to the synthetic biology community properly.

Additionally, the “group favorite” and “starRating” fields are also important for individual evaluation of the parts, which doesn’t get the deserved attention from the iGEM groups. “Group Favorite” defines the confidence on the part by the designer group. “StarRating” defines the related part in terms of popularity and usage efficiency among the groups. According to our observations, most groups are not aware of either of the fields or they are used incorrectly or ineffectively. For example for a part with a full reporter which is known to be functional and gives precise and expected results the StarRating should be at least 2 stars, but for most of the parts in 2010 distribution, it is very difficult to observe a part whose “StarRating” is above one. For quick determination of functionality of the parts these two evaluations are important so they have been included in the proposed standardization template. But, as they were not properly used up to now for the re-organization of the parts information during the development of our software application we had to include all parts to our queries regardless of their evaluations based on “Group Favorites” and “ StarRatings”


Input and Output Characteristics of Parts

=========================================

Parameters:

-Input:

• Promoter:

• Activity:

• Inducer:

• Activator:

• Repressor:

• Inhibitor:

• Promoter2:

• Activity:

• Inducer:

• Activator:

• Repressor:

• Inhibitor:

-Output:

• Reporter:

• Reporter2:

• Regulator:

• Inducer:

• Activator:

• Repressor:

• Inhibitor:

• Regulator2:

• Inducer:

• Activator:

• Repressor:

• Inhibitor:

-Working Condition:

=========================================

Table 3: The table above elaborately describes the input relations based on promoters and the output products based on the functional genes and RNAs which are included within the parts. Working condition simply describes any influencing factor or circumstance which is directly related with the functional properties of parts.

Third part of our standardization template includes parameters of contingent input and output elements. These parameters are classified into two groups for simplicity as presented on Table 3. This final part of the standardization template includes the upmost important information about the Biobricks that are required for the BioGuide Software to run its searching algorithm.

Briefly, BioGuide application is designed to catch the input and output relations of individual parts to examine possible Biobricks pathways for specific input and output queries. In other words, at pre-experimental stage, it helps wet lab biologists to design their unique constructs by revealing possible alternative options for pre-determined purposes, along with the primary paths. Our ultimate goal is to improve the algorithm designed for iGEM 2010 and present a new version of the BioGuide in iGEM 2011, which will provide optimum design of constructs for predetermined parameters.

Most of the parts are composed of functional and nonfunctional constructs which are formed by atomic parts. And every part should carry the information for all of its atomic parts within itself. The “input” heading actually stands for promoters. Parts with one or more promoters can be found at iGEM’s Parts Registry. Along with the information on which and how many promoters a part might have, the activity level of promoters are also important to distinguish between a constitutively active promoter or a promoter activated by specific physiological processes or states etc. This information was crucial for us to dissect in order to run our algorithm as it directly affects which inputs can activate the devices or the systems.

Throughout our investigations on the Parts Registry, we found out that much of the terminology was being used ambiguously. Although this might not be vital for synthetic biologists, it is still endeavoring to understand the function of certain regulatory elements which also becomes a time consuming task for the researcher. Thus, we recommend that the explanations of certain regulatory elements should be redefined and fixed especially for synthetic biology for easy communication, sharing and searching of information.

Common misuses of the terminology can guide us to figure out how to construct a standard nomenclature for synthetic biology. We claim that a standard nomenclature is urgently needed for synthetic biology for the following reasons. First of all, synthetic biology is an emerging research discipline and an industrial application area which is highly promising. Secondly, redefinition of the terminology to build a standard nomenclature is needed as some of the terms are prone to be used instead of another causing problems related to misuse for the global communication about synthetic biology. Lastly, the nomenclature has major importance for the construction of a persistent and trustworthy database for synthetic biology which serves for the information exhibition and exchange globally. For instance, there are obvious misunderstandings about the words which are predominantly used for regulation process. We have noticed that, the terms “inhibitor” and “repressor” are being used as equivocally in the part information pages. Like the lactose inhibitor protein, a widely used DNA-binding transcriptional repressor, that have been labeled both as “inhibitor” and “repressor” at iGEM’s Parts Registry. Similar problems resulting from ambiguous use of terminology also observed with regulatory elements. To sum up, we investigated all input elements for promoters and classify these elements in terms of their function, affect and required input element for them. So, we suggest that terminology used for regulation of transcription should be defined clearly on iGEM’s website and correct use of terminology should be enforced.

The second group of parameters was collected under the title “Output”, which refers to products of functional genes. In contradiction, the term “reporter” has also been described within the same list. Reporters are also genes whose products, can be used for screening as an output. According to our group, the usage of the term “reporter” for genes is unnecessary and cause extra complexity for information distribution and gives rise to discrepancies. Instead of using the term “reporter”, predefined “gene” description should be used for genes, which can function as reporters. The special information which is related with the characteristic of that gene should also be presented on part info web page.

Furthermore, the same terminology “reporter” was used for both atomic parts and composite bio-bricks. Also the overall image descriptions for these were defined as “reporters”. We want to point out that using same nomenclature for both atomic genes and for whole functional constructs contributes to the complexity and makes specific explorations difficult through the Parts Registry. So, assigning “reporter” for both atomic parts and for whole constructs is not a good practice. Instead, we are suggesting the usage of other available terminology for the parts listed as reporters, which most of the constructs, now known as reporters, can be grouped into, such as “protein generators”, “composite parts” or “inverters”.

Devices are whole constructs which are functional and have specific and distinct functions. But, as we have observed, unfortunately, the term “device” is also being used for parts which are not functional and do not have specific functional at all. Moreover, within the classification of devices, we argue that some terms are also being used unnecessarily and ambiguously. Devices are classified into five types which are protein generators, reporters, inverters, receivers and senders, measurement devices. For example iGEM defines protein generators as:

Protein generator = promoter + rbs +gene + terminator

Though we accept the definition for protein generators, we observed that there exist numerous parts which are defined as protein generators but actually most of them do not fit to the definition provided above. Although some parts are not functional and do not generate proteins at all, they are classified as protein generators, which makes searching for the parts difficult in the registry. Furthermore, there are also numerous parts which are defined as “composite parts” but actually they fit to the same definition with protein generators. In order to overcome the problem of misuse of device type we have extracted related image ID information for the composite parts. Image ID information helped us to correctly categorize composite parts depending on its individual atomic parts and identify the ones with more than one function, such as being both inhibitor and activator. In other words, we used image and part IDs in order to merge an input for its outputs.

Subtitle working conditions, includes all the detailed information about the experimental properties of parts, and the details about the working process of individual parts and complete devices. Additionally, we marked the subtitle “Working Condition” in our standardization template as potentially the most important title that helps synthetic biologist to better understand the parts functions at iGEM’s part registry database. The main problem we have encounter with the subtitle “working condition” is within most of the parts the details about working process is not enough and not provided regularly.


Examples of Misuse of Terminology:

For Composite Parts:

PartID: BBa_S04055

PartName: Synthetic lacYZ operon

This part is functional and responsible for the production of LacY and LacZ proteins. This part partially fits the definition for “composite part” but actually should be a protein generator as it fits fully to the definition of “protein generators”.

For Protein Generators:

PartID: BBa_J45299

PartName: PchA & PchB enzyme generator

The part which is illustrated above actually fits the definition for “composite part” but in part registry it is classified as protein generator. This part can be functional but it needs a promoter. Even though this part is not functional and is not capable of producing protein, part registry assigns this product as protein generator. We suggest that all parts in the registry, which are composed of more than one atomic part and which are not functional on their own but can be functional, should be classified as “composite parts”.

For Reporters:

PartID: BBa_J04451

PartName: RFP Coding Device with an LVA tag

This functional part is classified as “Reporter” in the parts registry database. It is very clear that this part fits the same description as Protein Generator in Biobrick part registry standards. Although, this part has specific and known functional role, characterizing this part as a reporter is unnecessary and contributes to the level of complexity of information provided. Instead, we suggest that this part should be classified as “protein generator” and related detailed information about the specific function of this part, should be provided in the part information page.

In conclusion, as mentioned above we tried to reorganize and normalize the information about parts which is provided in part registry for 2010 in order to develop our algorithm for the BioGuide application. During this process, we encountered some inconsistencies and misuses of the terminology being used and also inadequacies about the information provided about parts. First of all, we claim that a standard nomenclature should be constituted for future use in the field of synthetic biology. Based on the information gathered according to new nomenclature a professional database should be constructed to address the needs of synthetic biology. This will enable easy information exchange and exhibition globally. Secondly, although there are enough information about parts exists on parts registry database, the information which is provided for parts need to be ordered urgently. Furthermore, there should be new experimental standards which must be introduced to groups in the part submission process for the subtitle “working condition”. These experimental standards will be important because the experimental details about parts are not satisfying the needs of wet-lab biologists for the design and the construction of new Biobricks.

Contact
Results

Database Standardization

Graphical Modeling for Bio-Guide

Introduction

Graphical Modeling Theory has been applied to construct four different graphs where relations of atomic parts, devices and systems and the functional combinations that can build new constructs are presented for the iGEMs parts registry database. Three graphs are composed of iGEM devices and one graph is based on Biobricks. Each graph comprises a set of vertices or nodes and a set of edges. In the set of nodes each node represents a device, while in the set of edges each edge represents the input-output combination of the nodes. These graphs are directed graphs as the edges are created according to input-output combination. All compatibilities between a regulator and a promoter of an edge is created, where the source of this edge is the device with the corresponding regulator and target of the edge is the device with the promoter in concern.

Fig. 1: A node representing a device

Fig. 2: Arrow representing an edge between two nodes

The atomic structures used in our graphical model have been represented in Figures 1 and 2. A node is represented with a solid circle where the label, the part/device ID according to iGEM standards, of the device is marked on the foreground. The blue arrows between nodes connect the related devices, representing the input-output connectivity. End style of the arrow helps us to determine the direction of the node, like in Figure 2 where the node labeled BBa_S03520 is the source and BBa_JO9250 is the target.


Directivity

All the four constructed graphs build for BIO-Guide are directed graphs. So that, for every edge there must be a single source and a target. There is no single edge which is bidirectional. In mathematical form this can be represented as:

If an edge e has node v as source and node w as target then the edge can be expressed as

For a directed graph the combination (v, w) is totally different from (w, v). Therefore,

The direction of the edges has been represented with the arrows, as explained in Figure 2.


Connectivity

The nodes forming their own sub-graphs disconnected from the rest of the nodes have been recognized, which showed us the presence of incompatibility between few regulators and promoters of the devices. We have observed this disconnection in all four of our graphs. The basis of the disconnection has been shown in Figure 3, where the two sub-graphs without any edge that connects them to the main graph has been presented on the right hand side of the diagram. These features classify our graphs as disconnected graphs [1].

Fig. 3: A zoomed in screenshot showing two sub-graphs within the disconnected graph.


"Semi-Simplicity"

A simple graph is a graph in which no more than one edge contains the same set of nodes. So, in a simple graph it is not possible to find more than one edge with the same source and the same target. Additionally, an edge with the same source and target, forming a loop is not allowed. But, in synthetic biology it is possible to construct a device consisting of devices or bio bricks of the same species or type. Accordingly, our graphs are simple graphs with an exception of possible self-containing loops, where the edge starts from and ends on the same node. Our graphs have an exception of having loops and due to this permitted flexibility our graphs are "semi-simple".

For general information about graphs refer to:

[1] http://en.wikipedia.org/wiki/Graph_(mathematics)

Results