CS 4700: Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence Fall 2017 Instructor: Prof. Haym Hirsh Lecture 18

Prelim Grade Distribution

Homework 3: Out Today

Extra Credit Opportunity: 4:15pm Today, Gates G01 Relaxing Bottlenecks for Fast Machine Learning Christopher De Sa, Stanford University As machine learning applications become larger and more widely used, there is an increasing need for efficient systems solutions. The performance of essentially all machine learning applications is limited by bottlenecks with effects that cut across traditional layers in the software stack. Because of this, addressing these bottlenecks effectively requires a broad combination of work in theory, algorithms, systems, and hardware. To do this in a principled way, I propose a general approach called mindful relaxation. The approach starts by finding a way to eliminate a bottleneck by changing the algorithm's semantics. It proceeds by identifying structural conditions that let us prove guarantees that the altered algorithm will still work. Finally, it applies this structural knowledge to implement improvements to the performance and accuracy of entire systems. In this talk, I will describe the mindful relaxation approach, and demonstrate how it can be applied to a specific bottleneck (parallel overheads), problem (inference), and algorithm (asynchronous Gibbs sampling). I will demonstrate the effectiveness of this approach on a range of problems including CNNs, and finish with a discussion of my future work on methods for fast machine learning.

Today First-Order Logic (R&N Ch 8-9) Machine Learning (R&N Ch 18) Tuesday, April 5 Machine Learning (R&N Ch 18)

Resolution Conversion to CNF maintains satisfiability All steps guarantee equivalence except for Skolemization, which only maintains satisfiability Resolution is sound: If α Ͱ β then α β Resolution is refutation complete: If α β then α β Ͱ {} Godel s completeness theorem (No generalization that encompasses arithmetic is complete: Godel s incompleteness theorem)

Machine Learning

Learning

Learn: (dictionary.com) Learning 1. to acquire knowledge of or skill in by study, instruction, or experience 2. to become informed of or acquainted with; ascertain: to learn the truth. 3. to memorize: He learned the poem so he could recite it at the dinner. 4. to gain (a habit, mannerism, etc.) by experience, exposure to example, or the like; acquire: She learned patience from her father. 5. (of a device or machine, especially a computer) to perform an analogue of human learning with artificial intelligence. 6. Nonstandard. to instruct in; teach.

Machine Learning An agent is learning if it improves its performance on future tasks after making observations about the world.

Supervised Learning Given a training set of N example input-output pairs (x 1,y 1 ), (x 2,y 2 ),, (x n,y n ) where each y i was generated by an unknown function y = f(x), discover a function h that approximates the true function f.

Supervised Learning Given a training set of m example input-output pairs (x 1,y 1 ), (x 2,y 2 ),, (x m,y m ) where each y i was generated by an unknown function y = f(x), discover a function h that approximates the true function f. Classification learning: Domain of f is finite set of values

+ -

1 0

1-1

x 2 = 1.7x 1 4.9

x 2 = 1.7x 1 4.9 x 2 1.7x 1 = 4.9

x 2 = 1.7x 1 4.9 x 2 1.7x 1 = 4.9 2x 2 3.4x 1 = 9.8 10x 2 17x 1 = 49

Points above the line: x 2 1.7x 1 4.9 1 x 2 1.7x 1 4.9 0 2x 2 3.4x 1 49 10x 2 17x 1 49

f(x 1,x 2 ) = 1 if x 2 1.7x 1 4.9 0 otherwise 1 0

Formula for a line w 1 x 1 + w 2 x 2 = b

Formula for a line w 1 x 1 + w 2 x 2 = b Points above the line w 1 x 1 + w 2 x 2 b

f(x 1,x 2 ) = 1 if w 1 x 1 + w 2 x 2 b 0 otherwise 1 0

Generalizing to n dimensions: Formula for a line ( hyperplane ): w 1 x 1 + w 2 x 2 + + w n x n = b σ i=1 w i x i = b n

Generalizing to n dimensions: Formula for a line ( hyperplane ): w 1 x 1 + w 2 x 2 + + w n x n = b σ i=1 w i x i = b w x = b

Generalizing to n dimensions: Formula for a line ( hyperplane ): w 1 x 1 + w 2 x 2 + + w n x n = b σ i=1 w i x i = b w x = b Points above the line w 1 x 1 + w 2 x 2 + + w n x n b σn i=1 w i x i b w x b

Linear discriminant function: f(x 1,x 2,,x n ) = n 1 if σ i=1 w i x i b 0 otherwise

Linear discriminant function: f(x 1,x 2,,x n ) = n 1 if σ i=1 w i x i b 0 otherwise Goal of classification learning: Given: ((x 1,1,x 1,2,,x 1,n ),y 1 ), ((x 2,1,x 2,2,,x 2,n ),y 2 ),, ((x m,1,x m,2,,x m,n ),y m ) x 1 x 2 x m Find: (w 1,, w n ) and b

Notational trick : Equivalent to: w 1 x 1 + w 2 x 2 + + w n x n b w 1 x 1 + w 2 x 2 + + w n x n b 0

Notational trick : Equivalent to: w 1 x 1 + w 2 x 2 + + w n x n b w 1 x 1 + w 2 x 2 + + w n x n b 0 b + w 1 x 1 + w 2 x 2 + + w n x n 0

Notational trick : w 1 x 1 + w 2 x 2 + + w n x n b Equivalent to: w 1 x 1 + w 2 x 2 + + w n x n b 0 b + w 1 x 1 + w 2 x 2 + + w n x n 0 If x 0 = 1 bx 0 + w 1 x 1 + w 2 x 2 + + w n x n 0

Notational trick : w 1 x 1 + w 2 x 2 + + w n x n b Equivalent to: w 1 x 1 + w 2 x 2 + + w n x n b 0 b + w 1 x 1 + w 2 x 2 + + w n x n 0 If x 0 = 1 bx 0 + w 1 x 1 + w 2 x 2 + + w n x n 0 w 0 x 0 + w 1 x 1 + w 2 x 2 + + w n x n 0

Notational trick : w 1 x 1 + w 2 x 2 + + w n x n b Equivalent to: w 1 x 1 + w 2 x 2 + + w n x n b 0 b + w 1 x 1 + w 2 x 2 + + w n x n 0 If x 0 = 1 bx 0 + w 1 x 1 + w 2 x 2 + + w n x n 0 w 0 x 0 + w 1 x 1 + w 2 x 2 + + w n x n 0 σn i=0 w i x i 0

Linear discriminant function: f(x 0,x 1,x 2,,x n ) = 1 if σ n i=0 w i x i 0 0 otherwise Goal of classification learning: Given: ((1,x 1,1,x 1,2,,x 1,n ),y 1 ), ((1,x 2,1,x 2,2,,x 2,n ),y 2 ),, ((1,x m,1,x m,2,,x m,n ),y m ) x 1 x 2 x m Find: (w 0,, w n )

Linear discriminant function: f(x 0,x 1,x 2,,x n ) = f w (x) 1 if σ n i=0 w i x i 0 h w (x) 0 otherwise Goal of classification learning: Given: ((1,x 1,1,x 1,2,,x 1,n ),y 1 ), ((1,x 2,1,x 2,2,,x 2,n ),y 2 ),, ((1,x m,1,x m,2,,x m,n ),y m ) x 1 x 2 x m Find: (w 0,, w n )

Perceptrons

Neuron https://appliedgo.net/perceptron/

Perceptrons https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/

Perceptron Learning Rule Current hypothesis: h w (x) w 0 = w 1 = w 2 = = w n = 0 [alternatively: set to random values] Repeat For i = 1 to m [for each example] For j = 1 to n [for each feature] w j w j + αx i,j (y i h w (x i )) Until h w (x) gets all data correct [reorder data after each iteration]

Perceptron Learning Rule w j w j + αx j (y i h w (x i )) If h w (x) is correct, all w j are unchanged y i = h w (x i ), so (y i h w (x i )) = 0 If h w (x) is too big, w j decreases If h w (x) is too small, w j increases α is the learning rate (sometimes called η)

Perceptron Learning Rule: Example w j w j + αx j (y i h w (x i ))

Perceptron Learning Rule: Example w j w j + αx j (y i h w (x i )) x 1 x 2 f(x 1,x 2 ) 0 0 0 0 1 0 1 0 0 1 1 1

Perceptron Learning Rule: Example w j w j + αx j (y i h w (x i )) And gate x 1 x 2 f(x 1,x 2 ) 0 0 0 0 1 0 1 0 0 1 1 1

Perceptron Learning Rule: Example w j w j + αx j (y i h w (x i )) α = 0.3, w 0 = w 1 = w 2 = 0 Training Data x 1 x 2 f(x 1,x 2 ) 0 0 0 0 1 0 1 0 0 1 1 1

Perceptron Learning Rule: Example w j w j + αx j (y i h w (x i )) α = 0.3, w 0 = w 1 = w 2 = 0 Training Data x 1 x 2 f(x 1,x 2 ) -1 0.5 1-1 -1 1 1 0.5 0 0.5 1 0