From 2010.igem.org

(Difference between revisions)

Revision as of 21:48, 26 October 2010

miBEAT

miRockdown

miBS designer

miCrappallo

Modeling

Parameterization Concept

One of the hardest tasks in the development of our models was to come up with good strategy to generate input parameters from the raw data. In our case, the raw data is the binding site sequence and the corresponding sh/miRNA-sequence. The final parameterization concept unites a basic distinction between perfect, bulged (near-perfect) and endogenous miRNA like BS, with the advanced 3'-scoring and AU-content evaluation. The endogenous miRNA like BS parameter is further split into the three types of seed binding sites.

Neural Network Model

Neural Network theory

Artificial Neural Network usually called (NN), is a computational model that is inspired by the biological nervous system. The network is composed of simple elements called artificial neurons that are interconnected and operate in parallel. In most cases the NN is an adaptive system that can change its structure depending on the internal or(and?) external information that flows into the network during the learning process. The NN can be trained to perform a particular function by adjusting the values of the connection, called weights, between the artificial neurons. Neural Networks have been employed to perform complex functions in various fields, including pattern recognition, identification, classification, speech, vision, and control systems.

During the learning process, difference between the desired output (target) and the network output is minimised. This difference is usually called cost; the cost function is the measure of how far is the network output from the desired value. A common cost function is the mean-squared error and there are several algorithms that can be used to minimise this function. The following figure displays such a loop.

Figure 2: Training of a Neural Network.

Model description

Input/target pairs

The NN model has been created with the MATLAB NN-toolbox. The input/target pairs used to train the network comprise experimental and literature data (Bartel et al. 2007). The experimental data were obtained by measuring via luciferase assay the strength of knockdown due to the interaction between the shRNA and the binding site situated on the 3’UTR of luciferase gene. Nearly 30 different rational designed binding sites were tested and the respective knockdown strength calculated with the following formula->(formula anyone???).
Each input was represented by a four elements vector. Each element corresponded to a score value related to a specific feature of the binding site. The four features used to describe the binding site were: seed type, the 3’pairing contribution the AU-content and the number of binding site. The input/target pair represented the relationship between a particular binding site and the related percentage of knockdown. The NN was trained with a pool of 46 data. Afterwards it was used to predict percentages of knockdown given certain inputs. The predictions were then validated experimentally.

Characteristic of the Network

The neural network comprised two layers (multilayer feedforward Network). The first layer is connected with the input network and it comprised 15 artificial neurons. The second layer is connected to the first one and it produced the output. For the first and the second layer a sigmoid activation function and a linear activation function were used respectively. The algorithm used for minimizing the cost function (sum squared error) was Bayesian regularization. This Bayesian regularization takes place within the Levenberg-Marquardt algorithm. The algorithm updates the weight and bias values according to Levenberg-Marquardt optimization and overcomes the problem in interpolating noisy data, (MacKay 1992) by applying a Bayesian framework to the NN learning problem.

Figure 3: schematic illustration of the network components. Hidden represent the first layer and it comprised 15 artificial neurons, while output is the second and last layer producing the output. The symbol “w” was the representation of the weights and “b” of the biases.

Results

Training the Neural Network

The Network was trained with 46 samples. The regression line showing the correlation between the NN outputs and the targets was R=0.9864.

Figure 4: Regression line showing the correlation between the NN output and the respective target value.

↑

Simulation and experimental verification

Fuzzy Inference Model

Why using a fuzzy inference system to model binding site efficiency?

To be able to evaluate the complex features of an shRNA or miRNA binding site and predict a resulting knockdown percentage of the protein we developed a fuzzy inference system (fis). The parameterized properties of the binding sites serve as input and will be processed into the knockdown percentage as the single output. Thus our fuzzy inference system is characterized as a multiple input, single output fuzzy inference system (MISO). Fuzzy Logic is a rule-based approximate artificial reasoning method developed by Lotfi Zadeh in 1965. Its motivation is the observation that humans often think and communicate in a vague way, and yet can make precise decisions [Nelles O. Nonlinear System Identification Springer Verlag GmbH & Co., Berlin, 2000.]. It has been widely used in engineering and Artificial Intelligence approaches such as Fuzzy Controllers and Fuzzy Expert Systems. Fuzzy Logic has also been used for the modeling of biological pathways [Bosl W. J. Systems biology by the rules: hybrid intelligent systems for pathway modeling and discovery. BMC Systems Biology1:13 (2007).] and to analyze gene regulatory networks [Laschov D., Margaliot M. Mathematical modeling of the lambda switch:a fuzzy logic approach. J Theor Biol. 21:475-89 (2009)]. Key advantages of Fuzzy logic-based approaches are (i) the ability to construct models based on prior knowledge of the system and experimental data and (ii) encode intermediate states for inputs and outputs, thus improving other logic-approaches that can only deal with ON/OFF states such as Boolean models [Aldridge B. B., Saez-Rodriguez J., Muhlich J. L., Sorger P. K., Lauffenburger D. A. Fuzzy logic analysis of kinase pathway crosstalk in TNF/EGF/insulin-induced signaling PLoS Comput Biol.5:e1000340 (2009).] and (iii) simulations can be derived from both qualitative and quantitative data, both of which can be cast into the form of IF-THEN rules. Thus, FL constitutes a powerful approach for the understanding of heterogeneous datasets. Fuzzy inference systems are based on membership functions (MF). MF rate input parameters on a scale from 0 to 1, how much they satisfy a criterion. There can be one, or multiple criteria – called membership function - for one input parameter. The height of persons for example can be evaluated with one MF - how much the person satisfies being tall. On the other hand, there could be 3 MFs, one evaluating the membership to small people, the second to medium sized people and the third one to big people (Figure MembershipFunction1.png). In case of a persons height of 1.8 meter the MF “big” would be satisfied to about 0.6 (Figure MembershipFunctionBig.png). Like this, all input is converted to membership values from 0 to 1. Changing the shape of the MF gives the opportunity to have either functional dependencies, allowing intermediate states of the membership values, or simple ON/OFF states, where the membership value can be only 0 or 1 (Figure MembershipONOFF.png). Thus different kinds of input parameters can be evaluated with a fuzzy inference system. For the simple height example model the age of the person could be taken as second input and evaluated by a MF that is 0 until the age of 18 and 1 for older persons. Thus the model would differentiate between young and grown-up persons.

Simple if-then rules can then be used to combine the input MF to an output MF. The satisfaction of a rule by an object (set of input parameters) is defined by the degree of membership of the object to the different MF. The higher the satisfaction of the rule, the higher is the membership to the output MF. The output MF can be a function like the input MF. This is the case in Mamdani method fuzzy inference systems [Mamdani, E.H. and S. Assilian, "An experiment in linguistic synthesis with a fuzzy logic controller," International Journal of Man-Machine Studies, Vol. 7, No. 1, pp. 1-13, 1975.]. We are using a Sugeno method fuzzy inference system [Sugeno, M., Industrial applications of fuzzy control, Elsevier Science Pub. Co., 1985.], where the output MF is either a constant or a linear function depending on input parameters. The advantage of a Sugeno fuzzy inference system is, that it is computationally more efficient and easier to optimize or adapt due to the more simple output MF. Due to the non-intuitive combination of the 3'-pairing and AU-content score, our fuzzy inference system needs to be optimized computationally.

How is our fuzzy inference system optimized? MISO Sugeno Fuzzy Network Model

Optimizable

Extendable

Fuzzy Model Concepts

Bulged binding sites concept: This model concept evaluates bulged- or "near-perfect" binding sites separately from conventional seed + 3'-pairing binding sites. Rule number 2 considers the bulge-size of the bulged binding site.

Bulged binding sites (including AU-content-score) concept: This concept extends the bulged-BS concept with the addition of AU-content score evaluation. Therefore rule number 2 was modified accordingly.

Consider low 3' score concept: This model concept takes into consideration, that binding sites with a 3'-score under 3 did not show a significant change in knockdown efficiency compared to a control with only seed pairing (Grimson et al., 2007). This is realized by rule number 6.

Strength: general prediction, no dependency on conditions. Assured by [normalization strategy]

based on previous knowledge [Bartel]

Our fuzzy inference system can deal with 3 different kinds of shRNA binding sites. Perfect, bulged and endogenous-like binding sites are treated separately, due to the differences in their biological mechanism, as discussed earlier [link to binding site properties]. A perfect binding site is evaluated by a simple ON/OFF input MF evaluating the boolean input of

We came up with different concepts of what kind of input parameters to integrate into the fuzzy inference model and how to evaluate them. Therefore we parameterized the properties of a large set of binding sites according to various different BS characteristics. The targetscan_50_context_scores – Algorithm (Rodriguez et al., 2007) which evaluates binding sites in respect to 3'-pairing and AU-content gives out a score that seems appropriate to distinguish especially between endogenous miRNA like binding sites. A more detailed description on the concept of binding site parameterization can be found under Model Training Set.

Input parameters

Input membership functions

Output membership functions

Rules

Optimization

Parameters and their functionality

Output Membership function values

7merA1

7merM8

8mer

(Nearperfect)

(Perfect)

↑

Fuzzy Model Optimization

Result

Click here, if you are interested in more recent model optimizations results!

Training Set Overview

mi/shRNA-name	sequence	BSsequence	number of BS	perfect	bulged	bulge size	seed type	3' score	AU score
miR122_102	TGGAGTGTGACAATGGTGTT- TGT	GACAAACACCATTGTCACAC- TCCA	1	1	0	0	0	0	0
miR122_106	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATGAAGACACT- CCA	1	0	1	4	3	7.5	0.624
miR122_134	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATACGGACACT- CCAGAGACACAAACACCA- TGAAGACACTCCA	2	0	1	4	3	7.5	0.576
miR122_136	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATACGGACACT- CCA	1	0	1	4	3	7.5	0.595
miR122_197	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATGTCGACACT- CCA	1	0	1	4	3	7.5	0.597
miR122_199	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATGCCAACACT- CCA	1	0	1	4	3	7.5	0.603
miR122_201	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATACGAACACT- CCA	1	0	1	4	3	7.5	0.624
miR122_203	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATGCAGACACT- CCA	1	0	1	4	3	7.5	0.6
miR122_277	TGGAGTGTGACAATGGTGTT- TGT	ACAAACACCATGCCTACACT- CCA	1	0	1	4	3	7.5	0.603
miR122_138	TGGAGTGTGACAATGGTGTT- TGT	GGCCAGCACCATTTCACACA- CACTCCTTCTAGAGGCCGCT- GGC	1	0	0	0	2	5	0.336
miR122_140	TGGAGTGTGACAATGGTGTT- TGT	GCCCCTGATGGGGGCGACAC- TCCATCTAGAGGCCGCTGGC	1	0	0	0	3	1.5	0.327
miR122_142	TGGAGTGTGACAATGGTGTT- TGT	GACTAAGGCTGCTCCATCAA- CACTCCATCTAGAGGCCGCT- GGC	1	0	0	0	3	4	0.314
miR122_144	TGGAGTGTGACAATGGTGTT- TGT	GCAATGGAGAGTCACCTAGA- CACTCCATCTAGAGGCCGCT- GGC	1	0	0	0	3	2.5	0.314
miR122_146	TGGAGTGTGACAATGGTGTT- TGT	GACTTGAGCAGAACAAACAC- TCCATCTAGAGGCCGCTGGC	1	0	0	0	3	2	0.327
miR122_148	TGGAGTGTGACAATGGTGTT- TGT	GCAAATCATGATCAAAAACA- CTCCCTCTAGAGGCCGCTGG- C	1	0	0	0	2	2.5	0.221
sAg_19_bs_r10_12_acg_fw	GAACAAATGGCACTAGTAA	TTACTAGACGCATTTGTTC	1	0	1	3	2	5.5	0.442
sAg_19_bs_r10_12_taa_fw	GAACAAATGGCACTAGTAA	TTACTAGTAACATTTGTTC	1	0	1	2	2	6	0.492
sAg_19_bs_r9_12_acgg_fw	GAACAAATGGCACTAGTAA	TTACTAGACGGATTTGTTC	1	0	1	4	2	5.5	0.442
sAg_19_bs_r9_12_atgt_fw	GAACAAATGGCACTAGTAA	TTACTAGATGTATTTGTTC	1	0	1	4	2	5.5	0.495
sAg_19_bs_m10cg_fw	GAACAAATGGCACTAGTAA	TTACTAGTGGCATTTGTTC	1	0	1	1	2	6.5	0.442
sAg_19_bs_m10ca_fw	GAACAAATGGCACTAGTAA	TTACTAGTGACATTTGTTC	1	0	1	1	2	6.5	0.496
sAg_19_bs_m11gc_fw	GAACAAATGGCACTAGTAA	TTACTAGTCCCATTTGTTC	1	0	1	1	2	6	0.442
sAg_19_bs_m11ga_fw	GAACAAATGGCACTAGTAA	TTACTAGTACCATTTGTTC	1	0	1	1	2	6	0.466
sAg_19_bs_onlyseed_fw_E	GAACAAATGGCACTAGTAA	AATGATCACGGATTTGTTC	1	0	0	0	2	0	0.442
sAg_19_bs_p_fw_E	GAACAAATGGCACTAGTAA	TTACTAGTGCCATTTGTTC	1	1	0	0	0	0	0
sag_25_1	GAACAAATGGCACTAGTAAA- CTGAG	ATAATTTGTTCATTTGTTC	2	0	0	0	2	1.5	0.491
sag_25_2	GAACAAATGGCACTAGTAAA- CTGAG	ATAATTTGTTCATTTGTTCA- TTTGTTC	3	0	0	0	2	1.5	0.491
sag_25_3	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGTGCCATTTGT- TCAAAUAUAGCC	1	1	0	0	0	0	0
sag_25_4	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGTGCAATTTGT- TAAAAUUUAGCC	1	0	1	1	3	8.5	0.587
sag_25_5	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGTGAAATTTGT- TAAAAUUUAGCC	1	0	1	2	3	8	0.587
sag_25_6	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGTAAAATTTGT- TAAAAUUUAGCC	1	0	1	3	3	7.5	0.587
sag_25_7	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGAAAAATTTGT- TAAAAUUUAGCC	1	0	1	4	3	7	0.587
sag_25_8	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCAATTATAAATTTGT- TAAAAUUUAGCC	1	0	0	0	3	2	0.603
sag_25_9	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCAGCCGCAAATTTGT- TAAAGGCCCGCC	1	0	0	0	3	2	0.305
sag_25_10	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCAGCTATAATTTTG- TTAAAAUUUAGCC	1	0	0	0	1	2.5	0.69
sag_25_11	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACGCCGTAAATTTG- TTGAAGGCCCGCC	1	0	0	0	2	4	0.226
sag_25_12	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCAATTATAAATTTG- TTGAAAUUUAGCC	1	0	0	0	2	2	0.526
sag_25_13	GAACAAATGGCACTAGTAAA- CTGAG	TCCTTACTAGTGCAATTTG- TTAAAGGCCCGCC	1	0	0	0	3	7	0.305
sag_25_14	GAACAAATGGCACTAGTAAA- CTGAG	CTGAATATAGTGAAATTTG- TTAAAAUUUAGCC	1	0	0	0	3	4	0.603
sag_25_15	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTACCTAATTTTG- TTAAAAUCCGGCC	1	0	0	0	1	4	0.497
sag_25_16	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCCTAGTGGATTTTG- TTAAAGGCCCGCC	1	0	0	0	1	5	0.366
sag_25_17	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACATTGCAAATTTG- TTGAAGGCCCGCC	1	0	0	0	2	4	0.226
sag_25_18	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGACTAGTGCAATTTG- TTGAAGGCCCGCC	1	0	0	0	2	6	0.226
sag_25_20	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGAAAATTTTG- TTAAAAUUUAGCC	1	0	0	0	1	7	0.69
sag_25_21	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGAAAAATTTG- TTGAAAUUUAGCC	1	0	0	0	2	7	0.526
sag_25_23	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCATAGATAATTTTG- TTAAAAUUUAGCC	1	0	0	0	1	3	0.69
sag_25_24	GAACAAATGGCACTAGTAAA- CTGAG	CTGGTACTAGCTAATTTTG- TTAAAAUCCGGCC	1	0	0	0	1	5	0.497
sag_25_25	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTAGCCGATTTTG- TTAAAGGCCCGCC	1	0	0	0	1	7	0.366
sag_25_26	GAACAAATGGCACTAGTAAA- CTGAG	CTGGGCCTAGAAAAATTTG- TTGAAAUCCGGCC	1	0	0	0	2	4	0.348
sag_25_27	GAACAAATGGCACTAGTAAA- CTGAG	CTTTTACTAGAAAAATTTG- TTGAAUUUAGCC	1	0	0	0	2	6	0.51
sag_25_28	GAACAAATGGCACTAGTAAA- CTGAG	CTGGTACTAGGCAAATTTG- TTGAAGGCCCGCC	1	0	0	0	2	5	0.226
sag_25_29	GAACAAATGGCACTAGTAAA- CTGAG	GCTTTACTAGAAAAATTTG- TTAAAAUUUAGCC	1	0	0	0	3	6	0.603
sag_25_30	GAACAAATGGCACTAGTAAA- CTGAG	AGTTTACTTTAAAAATTTG- TTAAAAUUUAGCC	1	0	0	0	3	5	0.603
haat_bs_p_fw	AAACATGCCTAAACGCTTC	GAAGCGTTTAGGCATGTTT	1	1	0	0	0	0	0
haat_bs_r10_12_aat_fw	AAACATGCCTAAACGCTTC	GAAGCGTAATGGCATGTTT	1	0	1	3	2	5.5	0.799
haat_bs_r10_12_agc_fw	AAACATGCCTAAACGCTTC	GAAGCGTAGCGGCATGTTT	1	0	1	3	2	5.5	0.749
haat_bs_m10at_fw	AAACATGCCTAAACGCTTC	GAAGCGTTTTGGCATGTTT	1	0	1	1	2	6.5	0.799
haat_bs_m10ac_fw	AAACATGCCTAAACGCTTC	GAAGCGTTTCGGCATGTTT	1	0	1	1	2	6.5	0.773
haat_bs_m11ta_fw	AAACATGCCTAAACGCTTC	GAAGCGTTAAGGCATGTTT	1	0	1	1	2	1.5	0.38
haat_bs_m11tg_fw	AAACATGCCTAAACGCTTC	GAAGCGTTGAGGCATGTTT	1	0	1	1	2	1.5	0.38
haat_bs_r9_12_aatc_fw	AAACATGCCTAAACGCTTC	GAAGCGTAATCGCATGTTT	1	0	1	4	2	1.5	0.38
haat_bs_r9_12_agcc_fw	AAACATGCCTAAACGCTTC	GAAGCGTAGCCGCATGTTT	1	0	1	4	2	1.5	0.38
haat_bs_onlyseed_fw	AAACATGCCTAAACGCTTC	CTTCGCAAATCGCATGTTT	1	0	1	0	2	2	0.799

Team:Heidelberg/Modeling/descriptions