Lecture on Practical Deep Learning Statistical Physics Winter School, Pohang, January Prof. Kang-Hun Ahn.

Size: px

Start display at page:

Download "Lecture on Practical Deep Learning Statistical Physics Winter School, Pohang, January Prof. Kang-Hun Ahn."

Virgil Norris
5 years ago
Views:

1 Lecture on Practical Deep Learning Statistical Physics Winter School, Pohang, January 2018 Prof. Kang-Hun Ahn Basic of Python Numerical Methods TensorFlow Convolutional Neural Networks Generative Adversarial Networks Thanks to Hyun Jae Kim, Maruchan Park

2 1. Basic of Python Fortran and C languages have traditionally been used for computation in physics. Python, a popular language for artificial intelligence-related programs, can also perform computer calculations for physics. Before I show you how to solve physics problems with Python, you can study the Python language and solve some examples. Python is an easy, fast-paced language developed by Guido van Rossum in the 90's. Somewhat subjectively, I think the Fortran language is very easy to learn, but these days everyone who learns new languages, is overwhelmed with the view that the python is easy. In this lesson I will explain Python in the context of Linux. Anyone who uses MS Windows will be able to use Linux as a virtual machine. There are various modules in Python, such as numpy, tensorflow, pytorch, scipy, and so on. You can import them, like so. import numpy as np After importing you can use the various function in numpy. In this case, attach np in front of the function. In Korea, tensorflow is the most popular AI development tool, but pytorch has grown rapidly in recent years. After installing Anaconda, which is that helps installing modules, create an environment that includes some new tools, by typing the following command in Linux. >>conda create name torch2 python=3 pytorch numpy This will create an environment named torch2, where pytorch, python ver.3 and numpy are installed on your computer and imported from the program. The extension of Python code is py, and you can run it as

3 >>Python filename.py Now let's go into the grammar of python. To me, every computer language has the following elements, and it is essential to learn them. 1) Loop 2) Conditional statement 3) Substructure 4) Array Let s start with Loop. 1) Loop counter=0 while (counter <5): print(counter) counter=counter+1 In this case, if the condition of counter <5 is satisfied, the instruction is executed repeatedly to output 0,1,2,3,4. Unlike other languages, statements in Python that are placed under conditional statements must be positioned using tabs. You should also write : at the end of the first sentence of the iteration. for count in range(3): print(count) In this case the program prints 0,1,2.

4 2) Conditional statements The following example demonstrates its functionality. if ( name == kanghun ): print( nice ) elif ( name == devil ): print( bad ) else: print( idontknow ) It looks like there s no need for explanation. elif is used to add a condition, and else is used to refer to all other conditions. Do not forget to include a tab in the conditional statement. 3) Substructure Python has a format including Class which contain functions and variables. Before we begin to learn about the class, I will introduce the function first. def ww(): print( aaa ) The above code defines the function ww. There are also : and tabs. You must include parentheses even if no parameters are given to the function. If you type ww() then aaa is printed. No ":" is required for execution.

5 Let s talk about the Class. Again, a Class is a collection of functions and variables. If your code performs only one task, you may not need classes. But we usually overwrite code, created by others, and even combine with what we've created previously. A class is useful because it creates a new code by combining different functions. Classes are constructed for combining various codes. Consider the following code: class staff: def init (self, bonus): self.bonus=bonus def salary(self): salary=10000+self.bonus return salary First we defined a class called staff, with a :. Parentheses () are not included in the class. Class display their contents using tabs. I have created two functions in the Staff class. They are the init and the salary. In the case of salary, it is named, but init is the name of the function created in python. The feature of this function is that it runs automatically when the instance of the class is made. Be careful with the self, as variables that precede "self." are common within the class format. That is, the self. indicates which area the variable is used in. Writing a Class code is like a plan to execute a command in the future, and it makes an instance that gets it ready to run. In the above case let s make an instance ahn. ahn=staff(10000)

6 At this time, the instance is created with a value of in the in the self.bonus.. If you want to call the salary function in the staff class, you can type as aa=ahn.salary() print(aa) Then it prints Be sure to include parenthes with salary(). 4) Array If we declare np.ones ((3,2)), we have all 1s in a 3 by 2 matrix. Therefore, array([[1., 1.], [1., 1.], [1., 1.]]) Is made. Note the double square brackets are two [[]] in the array. This is because the matrix represents an array with rank 2. As you'll see later, the number of square brackets increases when dealing with higher-rank arrays. np.zeros ((3,2)) is likewise a 3 by 2 matrix containing 0s. When we want to specify the components of the array, we can use np.array. If we declare dat1= np.array( [[1,2,3],[4,5,6],[7,8,9]] ), the array dat1 becomes, array([[1,2,3], [4,5,6], [7,8,9]]).

7 If you are concerned about the size of the matrix you created, type dat1.shape, then it returns the size information, in this case (3,3). Note that the size of the array can vary even with the same components. a1 = np.array( [1, 2, 3] ) # (3,) size of 1dimensional array a2 = np.array( [ [1, 2, 3] ] ) # (1,3) size of 2 dimensinoal array a3 = np.array( [ [1], [2], [3] ] ) # (3,1) size of 2 dimensional array In the case of a3, if we represent it visually, it becomes array([[1], [2], [3]]). There is a list that has the same content as array a1 but is in a different form than the array form. This is very useful and often used, Write a1=[1,2,3], and do not use np.array. In this case, a1[0]= 1, a1[1]=2, a1[2]=3. In python, the default index starts from 0 as in C language. So, a1[2] contains the last element 3 of a1. You can also create more lists in any list. For example, if a=[1,2,3,[ a, b ]], then a[3] contains [ a, b ] and a[-1] does as well So, what does a[-1][1] contain? The answer is b. The list format has a lot of features such as cut, paste, operate, insert, remove, return position, and so on. Here are a few basic examples. When a=[1,2,3] a.append(4)

8 results in a=[1,2,3,4]. Once again with a.append([5,6]), a new list can be attached to the original list. Again, when a=[1,2,3], and a.reverse() is performed, you will get a=[3,2,1]. When a=[0]*10, a=[0,0,0,0,0,0,0,0,0,0]. In addition, if you write such as (in python3) # File I/O f=open( text.txt, w ) print(type(f)) for i in range(1,10): f.write( %d th line.\ % i) f.close() f=open( text.txt, r ) for i in f: print(i) You can see the following message. <class _to.textiowrapper >

9 1 st line. 2 nd line. 9 th line. Here, there are two f variables. The computer prints type(f) at first, and shows what class f is, and uses f to read and print the file. The second f is different from the first f because the first was cleared through f.close(). Note that it doesn t show to the 10 th line but to the 9 th line. In the second loop, i will automatically refer to the string. %d means an integer. Ex) Integrate sin(x) function from 0 to π (np.pi). Ex) You can generate random numbers from 0 to 1 using np.random.random(). Assuming that x is a uniform random number from 0 to π, calculate the distribution by dividing the y value distribution of y=x*sin(x) by the interval of 0.1 using 1000

10 x s. 2. Basic numerical analysis 1) Euler method Many physics equations are made up of ordinary differential equations. Among them quadratic differential equations, especially when solving dynamic methods over time, are very common. A suitable method for solving such ODEs is Euler method. While it is known that this makes significant errors, but in fact it is not. Reducing errors by reducing time steps has been a technical problem due to high computational cost, which is not a problem nowadays, due to greatly improved computational powers have been greatly improved. The Runge Kutta method is a better way to reduce errors than the Euler method, but it is a bit more complicated, so usually it is sufficient to use Euler method to write code and get the results quickly. Before I talk about Euler methods, we need to talk about a few basic things. Computers do not know physical dimensions. Therefore, computers only give numbers. This is actually the mistake most beginners make. In physics, physical quantities are measurable quantities and consist of just three dimension: length (L), time (T), and mass (M). You may have heard of many other physical quantities, but all physical quantities can be described using L, T, and M. For example, even though the unit of energy is Joule (J), it is defined as 1 kg * 1 m 2 * 1 s 2 and has a physical dimension of ML 2 T 2. So you only need to set up three units on your computer. If you specify three units for the length, time, and mass, all other units should be represented by these. If you set 1 nanometer = 1, 1 picosecond = 1, and 0.1 microgram = 1 on your computer, you can not assign any energy to 1 at your setup. In this case, the energy unit will automatically be 0.1 microgram * (1 nanometer / 1 picosecond) 2 = 10-4 J. If a calculation has an energy dimension and a value of 2 is given, it is correct to interpret it as 2*10-4 J. When we use differential

11 equations, the coefficients of the equations start with dimensionless values. For example, in the case of a harmonic oscillator, the following equation is written, mx + bx + kx = Fsinωt. In this case, the dot on the variable means the derivative according to time t. Two dots mean second derivatives, and one dot means first derivative. The physical dimension of each term is force and it has a dimension of MLT^2. First of all, we guess what kind of exercise to do when we solve it. If you roughly estimate the amplitude and the vibration frequency of the harmonic oscillator and set the units, the numbers which appear on the computer are easy to handle. In the above equation, the oscillator is oscillated by external force at the angular velocity ω. Therefore, it can be assumed that the oscillator moves approximately 1 / ω of the unit of time. So we can use t = ωt as a new time variable. Or the resonance angular frequency ω 0 = k m as t = ω 0 t. Define a dimensionless time with t = ω 0 t and divide each term by m, 2 ω d2 x + b ω dx 0 dt 2 m 0 dt + ω 0 2 x = F sin ω t. m ω 0 And we divide by ω 0 2 again, so it becomes d 2 x dt 2 + b dx mω 0 dt + x = F sin ω t. k ω 0 This equation has the length dimension and F / k represents the approximate length of the problem. Therefore, if the unit of length is set to l 0 = F/k and the dimensionless length is defined as x = x/l 0, the equation is given as d 2 x dt 2 + b dx + x = sin ω t mω 0 dt ω 0 now all coefficients are given in dimensionless variables. You can let the computer solve this equation and interpret the result using defined units. As can be seen from the above equation, if the two dimensionless parameters b mω 0 and ω ω 0 are the

12 same, we can see that the solution of x (t) is exactly the same. If the parameters are the same, you only have to do computation once for a given group of parameters. So, when analyzing dynamics, the typical scientific papers show how the exercise style differs according to these dimensionless parameters. Now let's assume that we made up a dimensionless equation like above and omitted the ~ sign. Suppose that there is the following differential equation, x + ax + f(x) = g(t). Even if the second derivative exists, defining a new variable can make it a firstorder differential equation. Of course, the same is true for more derivatives, dv dt dx dt = v, = av f(x) + g(t), Note that there should be no differential on the right hand side. The first line is like a line defining speed, but note that v is on the right. The differential values are now treated as variables and named as dxdt and dvdt. If the time step to be calculated is dt (usually 0.001), the Euler method is as follows, x= initial value v= initial value for i=1,..,imax t= i* dt dxdt=v dvdt=-a*v f(x) + g(t) x=x+dxdt * dt v=v+dvdt * dt write (time, x)

This gives you a position x over time. The point of the Euler method is to update a single point according to the derivative value as indicated in the red color on the following figure.

Several references have formulas that calculate and derive the error, but they are not actually needed.

13 This gives you a position x over time. The point of the Euler method is to update a single point according to the derivative value as indicated in the red color on the following figure. As shown in the figure above, there is always a rounding error in the Euler method. The reason why is that dt is actually not infinitely small. Several references have formulas that calculate and derive the error, but they are not actually needed. Decrease d to 1/10 and increase calculation time to imax by 10 times, plot a result see if they are the same, and if so then use it with no problem. If there is a lot of difference, you have to further reduce dt. How is it implemented with Python code? import numpy as np import matplotlib.pyplot as plt #Call pyplot from matplotlib and call plt. x0=1; v0=0. # When saving more than one variable, semicolon. a=0.1 dt=0.01 x=x0; v=v0 lx=[];lt=[] for i in range(1000): time=dt*i

14 lt.append(time) dxdt=v plt.plot(lt,lx) plt.show() dvdt=-a*v f(x) + g(time) x=x+dxdt*dt v=v+dvdt*dt lx.append(x) Exa) Using the above-mentioned Euler method, calculate motion according to b when m = 10 ng and k = 1 pn / nm for a harmonic oscillator without external force. The initial position is 10 nm and the initial velocity is zero.

15 2) Discrete Fourier transformation The Fourier transform is an important concept with wide application. n 1 A k = a m exp { 2πi mk n } m=0 n 1 k = 0,, 2 The above equation is for odd-numbered n. When n is even, we use n/2. Let a m be time series data with time and let the time interval be Δt (the time interval is given as the reciprocal of the sampling rate of the experimental equipment). Then, if the above Fourier transform is expressed as a continuous variable, t=m t and ω = 2π k n t import numpy as np import matplotlib.pyplot as plt def discrete_fourier_transform (a) : a = list(a) a_length = len(a) result = [] for k in range(0, a_length) : A_k = 0 for m in range(0, a_length) : i = (-2) * np.pi * 1j * m * k / a_length exp = np.exp(i) A_k = A_k + (a[m] * exp) result.append(a_k) return result

16 Ex) Fourier transform of exp(-0.01*m 1/10 )*sin (0.6m) m=1,2,3,,1000 import numpy as np import matplotlib.pyplot as plt lb=[];lk=[] def discrete_fourier_transform(a): a=list(a) a_length=len(a) result=[] for k in range(a_length): A_k=0 for m in range(a_length): i=(-2)*np.pi*1j*m*k/a_length exp=np.exp(i) A_k=A_k+exp*a[m] result.append(a_k) return result a=[] a.append(0) for m in range(1,1000): a.append(np.exp(-0.01*m**0.1)*np.sin(0.6*m)) b=discrete_fourier_transform(a) for k in range(0,500): lb.append(np.real(b[k])) lk.append(k) plt.plot(lk,lb) plt.show() Question) In the above example, k calculating to 500 is sufficient. Why?

17 Fourier transforms often deal with large amounts of data repeatedly in many cases. So, fast Fourier transform is often used. Example) Find the motion of the damped harmonic oscillator through the Euler

18 method, and then analyze the data using Fourier transform. 3) Gradient descent method Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Ex) Gradient descent has problems with pathological functions such as the Rosenbrock function shown here. f(x 1,x 2 )=(1-x 1 ) (x 2 -x 12 ) 2

19 The Rosenbrock function has a narrow curved valley which contains the minimum. The bottom of the valley is very flat. Because of the curved flat valley the optimization is zig-zagging slowly with small stepsizes towards the minimum.

20 3. Neural Network Artificial neural networks are computing systems inspired by the animal brain. Such systems learn by considering examples, generally without any intended algorithm. An artificial neural network is based on a collection of connected units or nodes called artificial neurons (analogous to biological neurons in an animal brain). Each connection (analogous to a synapse) between artificial neurons can transmit a signal from one to another. The artificial neuron that receives the signal can process it and then signal artificial neurons connected to it. The signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is calculated by a non-linear function of the sum of its inputs. Artificial neurons and connections typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons have a threshold such that only if the aggregate signal crosses that threshold is the signal sent. Typically, the neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers multiple times.

22 Here, σ is the sigmoid function which makes the step function smooth. The nonlinear function which is used for the output of the neuron is called activation function. These days, people commonly use a Rectified Linear Unit (ReLU) as the activation function because it has several advantages over sigmoid. 트레이닝은위에나타난 loss 함수 ( 또는 cost 함수라불림 ) 를줄이도록 weight

factor 와 threshold 를조절해가는과정이다. 이때트레이닝을위한데이터가 있고그것이제대로작동하는가를테스트하는데이터가있어서대략 8:2 로나누 어서사용한다. Loss 함수는계산된결과와기대하는결과의차이를나타낸것 인데두값의차이를제곱하여합한 least square ( 위그림참조 ) 를사용하거나 Cross entropy 함수를사용한다.

23 factor 와 threshold 를조절해가는과정이다. 이때트레이닝을위한데이터가 있고그것이제대로작동하는가를테스트하는데이터가있어서대략 8:2 로나누 어서사용한다. Loss 함수는계산된결과와기대하는결과의차이를나타낸것 인데두값의차이를제곱하여합한 least square ( 위그림참조 ) 를사용하거나 Cross entropy 함수를사용한다. C= (1/n) batch x[y(x)lna(x)+(1 y(x))ln(1 a(x))], 뉴럴넷을훈련시킬때 loss 함수값이작아지도록 Gradiet descent 방법을사용 할수있고그때 loss 함수를 weighting factor 또는 bias 로미분한미분값이필 요하다. 그미분값은 back propagation 방법으로쉽게구할수있으나, Tensor flow 등을사 용할때이런작업의코딩을직접할필요는없다.

24 Hidden layer 는문제의복잡성이증가할때그수를늘리는것이도움이 되는데 ( 항상그런것은아니다 ), 다음의 XOR 문제의예를보면 hidden layer 를 도입했기때문에문제를풀수있다는것을알수있다. ( 조정효박사강의참조 ) Example ) Solve a XOR problem by using a neural network.

25 4. Tensorflow Tensorflow is an open source software developed by Google Brain, research organization of Google. This software is designed for configuring AI programs, so it is suitable for making neural networks. Neural networks can be represented as a graph as shown in the figure. The circles are called neurons, and passing data from one to another can be represented by arrows. In this graph, the neurons form a node with arrows, and the circle itself contains some sort of operation, including what we will introduce later, Sigmoid or ReLU. In tensorflow, the neural networks to be calculated are first constructed as graphs. These graphs are just a representation of a plan to perform some calculations, that is, a kind of code generation. When you perform something called Session, data is input and the actual calculation is performed. In this process, the resources of the computer can be used in parallel. The arrows indicate the name tensorflow because it is represented by a tensor (I am not 100% sure). Let's create a simple tensorflow program that multiplies two numbers. import tensorflow as tf a=tf.placeholder( float ) b=tf.placeholder( float ) y=tf.multiply(a,b)

26 This code constitutes a graph of the tensorflow. a has no value in this situation, instead, a will be ordered to stay in place. It is a placeholder. b is the same. Multiplication with two placeholders is performed with the multiply command of tf. You don t need to separately specify the placeholder of y. The following steps are preparing to execute the calculation and running session. sess=tf.session() print(sess.run(y, feed_dict={a:3,b:3})) If we consider a graph as an architectural design, creating a session means that you prepare construction workers and construction equipment. In the above, sess is the name of the session, which is prepared to do so. You need a sess.run to start this. The last line gives inputting values and printing outputs at the same time. Note that sess is not executed until sess.run appears. And at the moment sess is run, all variables on the graph are assigned proper values.

27 Ex) Single Neural Layer Network import tensorflow as tf import input_data mnist = input_data.read_data_sets("mnist_data/", one_hot=true) x= tf.placeholder( float,[none,784]) W=tf.Variable(tf.zeros([784,10])) b=tf.variable(tf.zeros([10])) y=tf.nn.softmax(tf.matmul(x,w)+b) y_=tf.placeholder( float,[none,10]) cross_entropy = - tf.reduce_sum(y_*tf.log(y)) train_step=tf.train.gradientdescentoptimizer(0.01).minimize(cross_entropy) sess=tf.session() sess.run(tf.global_variables_initializer()) for i in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) This way, 100 pieces of data will be randomly sampled and used for training. Here, _xs means images and _ys means their labels. Now run the following code to check the test results. correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_:mnist.test.labels}))

28 In the first line, tf.argmax(y, 1) finds the largest value of y along axis = 1. The y matrix is None by 10 (= (None by 784) * (784 by 10)), where axis = 0 refers to the None side, row, while axis = 1 refers to the 10 side, column. "None" commonly means that any number is possible, and here, it means the number of input image data. Depending whether the maximum value of y and y_ in the first line is the same or not, it returns TRUE or FALSE as a shape of array. tf.cast will do this for [0,1,1,1,1,0,1,1... 1], by TRUE is 1 and FALSE is 0. Then, the accuracy percentage is obtained by tf.reduce_mean, which calculates the mean of the input array. Ref) MNIST data - The MNIST data-set is composed by a set of black and white images containing hand-written digits, containing more than examples for training a model, and for testing it. The MNIST data-set can be found at the MNIST database. - This data-set is ideal for most of the people who begin with pattern recognition on real examples without having to spend time on data preprocessing or formatting, two very important steps when dealing with images but expensive in time. - The images are centered in pixel frames by computing the mass center and moving it into the center of the frame. The images are like the ones shown here: - Also, the kind of learning required for this example is supervised learning; the images are labeled with the digit they represent. This is the most common form of Machine Learning.

29 - To download easily the data, you can use the script input_data.py, obtained from Google s site but uploaded to the book s github for your comodity. Simply download the code input_data.py in the same work directory where you are programming the neural network with TensorFlow. From your application you only need to import and use in the following way: import input_data mnist = input_data.read_data_sets("mnist_data", one_hot=true) - After executing these two instructions you will have the full training data-set in mnist.train and the test data-set in mnist.test. Each element is composed by an image, referenced as xs, and its corresponding label ys, to make easier to express the processing code. Remember that all data-sets, training and testing, contain xs and ys ; also, the training images are referenced in mnist.train.images and the training labels in mnist.train.labels.

5. Convolutional Neural Network (CNN) https://www.youtube.com/watch?v=dgkdehpsmq4 The convolutional Neural Network(CNN) is an innovative neural network introduced in 1998 by Yan LeCunn et al.

30 5. Convolutional Neural Network (CNN) The convolutional Neural Network(CNN) is an innovative neural network introduced in 1998 by Yan LeCunn et al. This has led to dramatic improvements in automatic image processing and now, it is widely used in advanced machine learning models. Let s look at how to implement CNN through a simple example code. import input_data mnist = input_data.read_data_sets('mnist_data', one_hot=true) import tensorflow as tf x = tf.placeholder("float", shape=[none, 784]) y_ = tf.placeholder("float", shape=[none, 10]) x_image = tf.reshape(x, [-1,28,28,1]) In the first two lines, we load the MNIST data through tensorflow. Then the placeholder literally holds a place to make room for tensors. A reshape changes the shape of x tensor into the shape in square brackets, where the first -1 means that you did not specify what number to be input, like NONE. The second and the third numbers indicate the size(28x28) of the image data. And the last number is number of input data channel; here it has to be 1 because MNIST

31 data are gray scale images. def weight_variable(shape): initial = tf.truncated_normal(shape, stddev=0.1) return tf.variable(initial) def bias_variable(shape): initial = tf.constant(0.1, shape=shape) return tf.variable(initial) def conv2d(x, W): return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='same') def max_pool_2x2(x): return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='same') Above codes are making functions for CNN. First two functions, weight_variable and bias_variable, are make random matrix with given shape. Conv2d is function for a convolution layer and max_pool_2x2 is function for a pooling layer. Convolution layers perform matrix multiplication among input data and weight variables. Then, in pooling layer, features are extracted by taking the largest value in each filter region. Basically, these processes are implemented by sharing all weight variables and bias of each filter through all hidden perceptrons. When the filters, convolution filters or pooling filters, pass through the image, the degree that the filter moves each time is called stride. Padding is attaching zero on boundary of input data to consider evenly about all input values. In above codes, convolution layer stride is (1x1), pooling layer stride is (2x2) and pooling filter size is (2x2). Both layers are performed with zero padding. See the below figures to understand how filtering processes work.

32 Zero padding Pooling layer ( Max

Above figure shows whole process that how data shapes change as data passes through each layer. After whole process, finally, input data was classified through a fully connected layer.

33 Above figure shows whole process that how data shapes change as data passes through each layer. After whole process, finally, input data was classified through a fully connected layer. - Full code from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('mnist_data', one_hot=true) import tensorflow as tf import matplotlib.pyplot as plt x = tf.placeholder("float", shape=[none, 784]) y_ = tf.placeholder("float", shape=[none, 10]) x_image = tf.reshape(x, [-1,28,28,1]) print("x_image=", x_image) def weight_variable(shape): initial = tf.truncated_normal(shape, stddev=0.1) return tf.variable(initial) def bias_variable(shape):

34 initial = tf.constant(0.1, shape=shape) return tf.variable(initial) def conv2d(x, W): return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='same') # 맨앞의 1 은한놈씩다룬다는뜻, 가운데둘은 1x1 stride 즉위나아래나한칸씩 움직인다. 이렇게하면 convolution 해도결과이미지의크기는바뀌지않겠지. 마지막은 1 은 흑백을의미하는것임. def max_pool_2x2(x): return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='same') # 커널사이즈 2x2 스트라이드 2x2 즉위아래두칸씩. 이러면결과이미지크기가반으로 1/2 x 1/2 만큼줄겠지. W_conv1 = weight_variable([5,5, 1, 32]) b_conv1 = bias_variable([32]) # 32 개의 5x5 필터를이용해서인풋개수 1 개의인풋이미지로 32 개의아웃풋이미지를 만드는필터. h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) h_pool1 = max_pool_2x2(h_conv1) #convolution 하고나서는여전히이미지크기가 28x28 이다. 풀링을하고나서 14x14 로 이미지크기가준다. 이러한이미지는 32 개가존재한다. print(x_image.get_shape()) # 이걸하면 (?,28,28,1) 이라고나오겠지. 28x28 이미지하나. print(h_conv1.get_shape()) # 이걸하면 (?,14,14,32) 라고나오겠지. 32 개의 14x14 이미지라는뜻. W_conv2 = weight_variable([5,5, 32, 64]) b_conv2 = bias_variable([64]) #32 개의이미지를훑는필터 64 개. 그필터의사이즈는 5x5. # 여기서 32 개의이미지를훑기때문에이미지가사실은 3 차원구조 14x14x32. h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # 이걸거치고나면 64 개의 14x14 이미지가생긴다. h_pool2 = max_pool_2x2(h_conv2) #2x2 풀링을통해 7x7 이미지로변환. 64 개. print(h_conv2.get_shape()) #(?,14,14,64) print(h_pool2.get_shape()) #(?,7,7,64)

35 W_fc1 = weight_variable([7*7*64, 1024]) b_fc1 = bias_variable([1024]) # Fully connected network 을위한준비 h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64]) #none 을위한축을넣어서계산 모양을맞투어줌 h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) W_fc2 = weight_variable([1024, 10]) b_fc2 = bias_variable([10]) y_conv = tf.nn.softmax(tf.matmul(h_fc1,w_fc2)+b_fc2) cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv)) train_step = tf.train.adamoptimizer(0.0003).minimize(cross_entropy) correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) sess = tf.session() sess.run(tf.global_variables_initializer()) Acc_train = [] Acc_test = [] acc_te = 0 for i in range(3001): batch = mnist.train.next_batch(50) sess.run(train_step, feed_dict={x: batch[0], y_: batch[1]}) # batch 는 [[ 데이터 ],[ 라벨 ]] 이렇게돼있다. 그리고데이터는 50x784 라벨은 50x10. if i % 10 == 0: acc_tr = sess.run(accuracy, feed_dict = {x:batch[0], y_:batch[1]}) acc_te = sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}) print("step %d, training accuracy %g"%(i, acc_tr),"test accuracy %g"%acc_te) Acc_train.append(acc_tr) Acc_test.append(acc_te) Project) 1. MNIST 를직접코드를돌려보고 layer 의개수와크기를변화시켜 높은정답률을보이시오. 시오. 2. 스스로이미지나데이터를구해서재미있고쓸모있는분류작업을구현하

A training sequence involves, the generator making fake data from real data, and the discriminator classifying the fake

36 6. Generative Adversarial Network A Generative Adversarial Nets (GAN) consists of two models, a discriminator and generator. The discriminator and generator compete with each other to improve performance. A training sequence involves, the generator making fake data from real data, and the discriminator classifying the fake and real data. Therefore, we train the two networks by optimizing loss function, min G max D V(D, G) = E x~pdata (x)[logd(x)] + E z~pz (z)[log (1 D(G(z)))],

37 Where D and G are the discriminator and generator, respectively, x is real data and z is latent variables, and D(x) is a probability that the input x is real data. G(z) has the same dimension as real data. First, we maximize V(D,G) by updating the parameters of the discriminator, and likewise minimize V(D,G) for the generator. It has been proven that solving the above equation produces equivalent fake and real data [1]. Proof) V(G, D) = x p data (x) log(d(x)) dx + p z (z) log (1 D(g(z))) dz z = p data (x) log(d(x)) + p g (x) log(1 D(x)) dx x Where p_g(x)dx = p_z(z)dz 가되도록 p_g 를선택한다. max D V(G, D) = p data (x) log(d G (x)) + p g (x) log(1 D G (x)) dx, x where D p data (x) G (x) = p data (x) + p g (x) 이건 D 로미분해서알게됨. 그래서 max D V(G, D) = E x~pdata (x)[logd G (x)] + E x~pg [log (1 D G (x))] p data (x) = E x~pdata (x) [log p data (x) + p g (x) ] + E p g (x) x~p g [log p data (x) + p g (x) ] p data (x) = log(2) + E x~pdata (x) [log + log (2)] log(2) p data (x) + p g (x) p g (x) + E x~pg [log + log (2)] p data (x) + p g (x) = log(4) + KL (p data p data + p g 2 = log(4) + 2 JSD(p data p g ) ) + KL (p g p data + p g ) 2 최적상태를구한다면만들어지는데이터의분포는 p_g 실재존재하는 데이터의분포 p_data 와같아진다. p_g=p_data JSD(p q) = 1 2 KL(p M) KL(q M), KL(p q) = p ilog ( p i ), M = 1 (p + q) q i 2 i

38 However, a practical GAN training differs from this theoretical process, so fake data vary widely from real data. Therefore, before programming a GAN, let s take a look at successful application first. 1) Least Square GAN A GAN uses cross-entropy for the loss function V(D,G) with sigmoid function on the output of the discriminator. In this case, since the real data and the fake data are very easy to distinguish at the beginning of training, D has a value close to 0, and the gradients become very small. Like the above GAN proof, we can get coefficients a, b, and c through solving the following optimization problems [2], min D V LSGAN (D) = 1 2 E x~p data (x)[(d(x) b) 2 ] E z~p z (z)[(d(g(z)) a) 2 ] min G V LSGAN (G) = 1 2 E x~p data (x)[(d(x) c) 2 ] E z~p z (z)[(d(g(z)) c) 2 ]. Practice 1) Find the V LSGAN for both discriminator and generator.

39 min D V LSGAN (D) = 1 2 E x~p data (x)[(d(x) b) 2 ] E z~p z (z)[(d(g(z)) a) 2 ] p_g(x)dx = p_z(z)dz p d (x)[(d(x) b) 2 ] + p g (x)[(d(x) b) 2 ] dx is optimal when min G V LSGAN (G) = 1 2 E x~p data (x)[(d(x) c) 2 ] E z~p z (z)[(d(g(z)) c) 2 ]. Define 이것을최소화하는것이 p_g =p_d 가되게하려면위의적분이 다음과같으면된다. = ((p d(x) + p g (x)) 2p g (x)) 2 p d (x) + p g (x) dx 그러므로 b-c =1 b-a=2 2) Conditional GAN [3] Let s consider that we successfully trained a GAN with the MNIST data set. A trained generator makes perfect hand written digits, but we cannot choose any particular digit. We can train the GAN with label information by conditioning the input of the discriminator and generator. In the case of MNIST, the discriminator gets images of

40 digit and labels, and the generator gets latent variables and labels for the images. 3) Deep Convolutional GAN A Deep Convolutional GAN (DCGAN) [4], which has successfully trained many data sets (especially image data), has the following structure. Here, strided convolution means any convolution with a stride larger than 2. Strided convolution downsizes input into output, and fractional-strided convolution extends input into output. For a generator with a convolutional net, fractionalstrided convolution can be used to match the output size with real data.

$fractional-strided convolution as a built-in function in$ tensorflow. tf.nn.

model, the training iteration corresponds to the number of

It is efficient to use GPU to update multiple data at once.

41 4) Fractional-strided convolution We can use fractional-strided convolution as a built-in function in tensorflow. tf.nn.conv2d_transpose() 5) Batch normalization When we train a model, the training iteration corresponds to the number of updates of the parameters, or weights. It is efficient to use GPU to update multiple data at once. Therefore, update values are averaged over multiple data. In this case, it is better to normalize values of the layers of the batch for stable training.

42 Training) Input: Values of x over a mini batch B = {x 1 m}; Parameters to be learned: γ, β Output: y i m μ B 1 x m i=1 i //mini-batch mean σ 2 B 1 m (x m i=1 i μ B ) 2 //mini-batch variance x i x i μb σ B 2 +ε y i γx i + β //normalize //scale and shift Test) x = x E[x] Var[x] + ε, E[x] E B[μ B ], Var[x] m m 1 E B[σ 2 B ] y = γx + β - ReLU ReLU(x) = { x, x > 0 0, Otherwise - Leaky ReLU LeakyReLU(x) = { cx x, x > 0, Otherwise [1] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems [2] Mao, Xudong, et al. "Least squares generative adversarial networks." 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, [3] Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." arxiv preprint arxiv: (2014). [4] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with

43 deep convolutional generative adversarial networks Structure of Generator r Xiv:7 Project) 3. x(t) = sin(w t) 함수를만드러내는 Generator 를 Generative Adversarial Network 을이용해서만들어내시오.

Lecture on Practical Deep Learning Statistical Physics Winter School, Pohang, January Prof. Kang-Hun Ahn.

Lecture on Practical Deep Learning Statistical Physics Winter School, Pohang, January 2018 Prof. Kang-Hun Ahn ahnkanghun@gmail.com http://deephearing.org Basic of Python Numerical Methods TensorFlow Convolutional