Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

Similar documents
3D HP Protein Folding Problem using Ant Algorithm

Chapter 1 Linear Equations. 1.1 Systems of Linear Equations

Week 2: Defining Computation

CSCI3390-Assignment 2 Solutions

Properties of Arithmetic

Pentagon Walk Revisited. Double Heads Revisited Homework #1 Thursday 26 Jan 2006

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Languages, regular languages, finite automata

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

2 - Strings and Binomial Coefficients

Announcements Wednesday, October 25

Example: 2x y + 3z = 1 5y 6z = 0 x + 4z = 7. Definition: Elementary Row Operations. Example: Type I swap rows 1 and 3

Knots, Coloring and Applications

Monte Carlo Simulations of Protein Folding using Lattice Models

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Turing Machines and Time Complexity

Using Microsoft Excel

CS 420/594: Complex Systems & Self-Organization Project 1: Edge of Chaos in 1D Cellular Automata Due: Sept. 20

Lab 1: Handout GULP: an Empirical energy code

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Math 20 Spring Discrete Probability. Midterm Exam

Math 138: Introduction to solving systems of equations with matrices. The Concept of Balance for Systems of Equations

Example: 2x y + 3z = 1 5y 6z = 0 x + 4z = 7. Definition: Elementary Row Operations. Example: Type I swap rows 1 and 3

Row Reduced Echelon Form

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018

The tape of M. Figure 3: Simulation of a Turing machine with doubly infinite tape

Assignment 1 Physics/ECE 176

Sin, Cos and All That

Class President: A Network Approach to Popularity. Due July 18, 2014

Lecture 8: A Crash Course in Linear Algebra

LAB 2 - ONE DIMENSIONAL MOTION

Announcements. Problem Set 6 due next Monday, February 25, at 12:50PM. Midterm graded, will be returned at end of lecture.

CE 365K Exercise 1: GIS Basemap for Design Project Spring 2014 Hydraulic Engineering Design

Polynomial Models Studio Excel 2007 for Windows Instructions

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Math 31 Lesson Plan. Day 5: Intro to Groups. Elizabeth Gillaspy. September 28, 2011

Lecture 4: Constructing the Integers, Rationals and Reals

A.I.: Beyond Classical Search

(+2) + (+7) = (+9)? ( 3) + ( 4) = ( 7)? (+2) + ( 7) = ( 5)? (+9) + ( 4) = (+5)? (+2) (+7) = ( 5)? ( 3) ( 4) = (+1)?

Lattice protein models

Sin, Cos and All That

Mathematica Project 3

Let s Do Algebra Tiles

Evolutionary Models. Evolutionary Models

Machine Learning: Homework 5

Section 1.2. Row Reduction and Echelon Forms

You Might Also Like. Thanks. Connect

Markov Chains and Related Matters

Finite Math - J-term Section Systems of Linear Equations in Two Variables Example 1. Solve the system

CSE370 HW3 Solutions (Winter 2010)

PHYSICS 3266 SPRING 2016

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

UNIVERSITY OF MICHIGAN UNDERGRADUATE MATH COMPETITION 31 MARCH 29, 2014

Lecture 8 HASHING!!!!!

Project One: C Bump functions

LOCAL SEARCH. Today. Reading AIMA Chapter , Goals Local search algorithms. Introduce adversarial search 1/31/14

Conformational Analysis of n-butane

Lab 9: Maximum Likelihood and Modeltest

1.2 Inductive Reasoning

Bio nformatics. Lecture 3. Saad Mneimneh

LAB 3: VELOCITY AND ACCELERATION

The First Derivative and Second Derivative Test

Math 350: An exploration of HMMs through doodles.

Talk Science Professional Development

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Self-assessment due: Monday 3/18/2019 at 11:59pm (submit via Gradescope)

Lecture Notes: Markov chains

CHEM 121: Chemical Biology

Getting Started with Communications Engineering

Lectures about Python, useful both for beginners and experts, can be found at (

Assignment 2 Atomic-Level Molecular Modeling

Lecture Notes: The Halting Problem; Reductions

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

RANDOM WALKS AND THE PROBABILITY OF RETURNING HOME

Star Cluster Photometry and the H-R Diagram

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t]

2.4 Investigating Symmetry

Project 1: Edge of Chaos in 1D Cellular Automata

Probability Exercises. Problem 1.

Section 6.3. Matrices and Systems of Equations

Preliminary Physics. Moving About. DUXCollege. Week 2. Student name:. Class code:.. Teacher name:.

This exam is closed book and closed notes. (You will have access to a copy of the Table of Common Distributions given in the back of the text.

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Lecture 8 : Structural Induction DRAFT

Unit 4 Patterns and Algebra

Sample Project: Simulation of Turing Machines by Machines with only Two Tape Symbols

Conservation of Energy

Who Has Heard of This Problem? Courtesy: Jeremy Kun

UNIVERSITY of OSLO. Faculty of Mathematics and Natural Sciences. INF 4130/9135: Algorithms: Design and efficiency Date of exam: 12th December 2014

BIOCHEMISTRY Unit 2 Part 4 ACTIVITY #6 (Chapter 5) PROTEINS

Physics 121 for Majors

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

Data-Intensive Computing with MapReduce

You Might Also Like. I look forward helping you focus your instruction while saving tons of time. Kesler Science Station Lab Activities 40%+ Savings!

3.2 Probability Rules

CS 154 Introduction to Automata and Complexity Theory

Math 3361-Modern Algebra Lecture 08 9/26/ Cardinality

COSMOS: Making Robots and Making Robots Intelligent Lecture 3: Introduction to discrete-time dynamics

Class Note #20. In today s class, the following four concepts were introduced: decision

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Transcription:

Homework 9: Protein Folding & Simulated Annealing 02-201: Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM 1. Set up We re back to Go for this assignment. 1. Inside of your src directory, create a directory called fold. 2. Copy the canvas.go file from HW4 and put it into the fold directory. Also download the code.google.com directory from Piazza if it isn t already in your src directory. 3. Set your GOPATH environment variable to the location of your go directory that you made above. On a Mac, export GOPATH=/Users/pcompeau/Desktop/go where you replace the directory name after the = with the location of the go directory you just made. On Windows, use set GOPATH=C:\Users\pcompeau\Desktop\go 2. Assignment 2.1 Protein folding Proteins are linear molecules that primarily consist of a chain of amino acids strung together. There are 20 commonly occurring amino acids in most organisms, so we can think of a protein as a string formed from a 20-symbol alphabet. This is called the primary structure of the protein. Yet the function of a protein also depends on its three-dimensional structure after it folds in on itself (called its secondary structure). A fundamental (and difficult!) challenge in modern biology is therefore to predict a protein s secondary structure from its primary structure. For example, here are two example protein folds and their amino acid sequences: 1

(a) 1dtk (b) 1dtk 1dtk 5pti XAKYCKLPLRIGPCKRKIPSFYYKWKAKQCLPFDYSGCGGNANRFKTIEECRRTCVG- RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA To begin modeling this biological problem computationally, we assume that a protein folds into its lowest-energy conformation. So if we assume that we have a function energy(s) that takes a structure and evaluates its energy, we are looking for a structure S that minimizes energy(s). While a lot of progress has been made on this problem over the years (including by the winners of the 2013 Nobel Prize in Chemistry), the problem remains difficult; attempting to formulate a decision problem for whether a given amino acid sequence can fold into a protein of energy beneath some threshold leads to NP-Complete problems everywhere we look. 2.2 The HP model In order to make progress on several aspects of the protein folding process, a simplified model introduced by Dill 1 has been studied. This model is called the HP model because we simplify the 20 amino acid alphabet down to just two types of amino acids: H, which are hydrophobic (like to be away from water), and P which are polar (don t mind being near water). The second simplification in our case is that proteins are modeled in 2-D instead of 3-D, and that they only fold on a square lattice. The following are examples of such an HP protein fold, where the circles represent amino acids and the solid nodes are the H amino acids (the leftmost HP protein has the sequence HHPHPHHPHHPPHHH): Notice that each folded protein is represented by a walk through the lattice, in which case we can 1 Dill. Theory for the folding and stability of globular proteins. Biochemistry 24(6):1501 9, 1985 2

represent such a walk by a sequence of three relative directives: right (r), left (l), forward (f). The walk is described as though you were giving driving directions. For example, a walk that looks like the number 7 would be represented as r, r, f (we assume that initially, the walker is facing up). Almost all walks can be described in this way; the only exception to this is the first point: in this scheme, we cannot represent the walk that looks like a u because there is no way to specify go backwards at the first point. Fortunately, this does not matter since we don t care about the orientation of the protein. For example, the leftmost structure in the row of 4 above, is specified by l, l, r, f, r, f, f, r, f, f, r, r, f, l. This sequence of instructions is called the fold or the structure. The last simplification we make relates to the energy function. When dealing with real proteins, the energy function usually is a combination of many physical energies. For the HP model, the energy function measures how buried the H atoms are. Let the amino acid sequence be given by p 1, p 2,..., p n, where p i = 1 if the ith amino acid is an H and p i = 0 otherwise. Then the energy of a folded protein is n energy(s, p) = 10x p i s i, where s i equals the number of amino acids at neighboring lattice points, including diagonals, that are not adjacent to the ith point in the walk, and x is the number of times the walk crosses itself. In this example, i=1 s i = 4 because of the 8 nearby points, 4 contain amino acids and are not adjacent to i in the structure. Our goal in this assignment will be to write a program that is given an HP protein sequence p and finds a structure S (walk along a square lattice) that minimizes energy(s, p). An alternative problem formulation is to find a non-crossing structure S that minimizes energy(s, p). 2.3 Simulated annealing To find a low energy structure, we will use simulated annealing, which is a widely useful optimization framework. Our simulated annealing framework depends on several parameters and works as follows: 1. Set i = 1; Choose a random structure S i (i.e. a random sequence of l, r, f). 2. Compute E i = energy(s i, p). 3. Change a random letter of the walk S i to a random different letter to obtain a slightly different structure S i. 4. Compute E i = energy(s i, p). 3

5. If E i < E i, set S i+1 = S i. Otherwise, if E i > E i, let q = e (E i E i)/kt ; with probability q set S i+1 = S i, and with probability (1 q) set S i+1 = S i. 6. Let i = i + 1, and if i mod 100 = 0 let T = 0.999T. 7. If the structure hasn t changed for m iterations, stop; otherwise go to step 2. 8. Return the S j with the lowest energy. In other words, we keep making random small changes to the structure. If the change improves the structure (lower energy), we keep the change. If the change doesn t improve the energy, we keep the change with probability e e( )/kt ) where is the amount by which the energy went up. The idea here is that we always accept improvements, and we accept bad changes with probability inversely proportional to how bad they are. The constant k is a normalizing factor to scale the energy differences. Variable T is the interesting one: when T is large, we are more likely to accept changes that increase the energy, and when T is very small, we nearly never do. T can be viewed as a temperature of the system: when the system is hot, we bounce around a lot, and as it cools we head towards low-energy solutions. Reasonable choices for the parameters for a protein of length n are m = 10n, k = 6n, and the initial T = 10n. 2.4 What you should do You should write a Go program that can be run with the following command line:./fold HP where HP is a string of H and Ps representing the HP protein sequence. Your program should output two lines of text: Energy: ENERGY Structure: lrf where ENERGY is the energy (integer) of the final structure, and lrf is a sequence of l, r, f that specifies the lowest energy structure you found. You should also write a file called fold.png that draws the structure. You can use any nice format for the drawing that makes the Hs dark colored and the Ps light colored. For example using edges of length 10 pixels, and drawing the H-residues as solid black squares and P-residues as solid white squares with a black boarder. Drawing the final fold is extra credit for 02-201 students. Note: You can use any algorithm you want to find a low-energy structure. Of course, simulated annealing is a good choice, but you can modify simulated annealing if you need to in order to obtain a better structure. You can also modify aspects of the basic simulated annealing algorithm above. For example, you can change how often T is changed (maybe even sometimes increasing it), or you could run several independent simulated annealing runs from different starting points, and return the best solution found, etc. 4

2.5 Tips on how to start First write the main function to parse the arguments from the command line. Then write a function randomfold that generates a random fold as a sequence of l, r, and f commands. Then, write a function drawfold that draws the fold onto a sufficiently large 2-D matrix. Next, write and test a function energy(s, p) that computes the energy of a fold S and sequence p. This function probably calls drawfold. Next, write a randomfoldchange function that takes a fold and randomly changes one of its commands. Next, write optimizefold(p) that returns the lowest-energy fold you can find for p (likely by running the simulated annealing algorithm). Play around with the parameters and the algorithm to see if you can get better and better folds. Finally, write paintfold that draws the fold onto a canvas and saves the canvas to fold.png. 2.5.1 Tip for getting your program to run faster: In the simulated annealing code, if you find a new structure with the same energy, don t switch to it. 2.5.2 Tips for making the code easier to write: For the drawing, you can use any format you want for the drawing so long as it shows the structure in a reasonable way. The data structure the solution uses for laying out the fold is [][]string. Then M[x][y] lists all the residues that are in that position (if crossings are allowed). If you d prefer, you can simply reject structures that have crossings (i.e. they have infinite energy). Don t try to track which amino acids are adjacent to one another. Rather, lay them out in a 2D array, compute the score ignoring which residues are adjacent in the walk, and then subtract 2 for every H in the middle of the sequence, and 1 for every H at the ends of the sequence. 3. Learning outcomes After completing this homework, you should have implemented a simulated annealing optimizer (or other approach for optimizing); learned about the HP protein folding model; gained additional practice writing Go programs. 5