Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation function, 104 105 tanh activation function, 107 AdadeltaOptimizer, 133 134 AdagradOptimizer, 130 131 AdamOptimizer, 135 Auto encoders architecture, 323 cases, 324 combined classification network, class prediction, 326 denoising auto-encoder implementation, 333 element wise activation function, 324 hidden layer, 323 KL divergence, 327 329 learning rule of model, 324 multiple hidden layers, 325 network, class prediction, 326 sparse, 328 unsupervised ANN, 322 B Backpropagation, 109 convolution layer, 183 185 for gradient computation cost derivative, 116 cost function, 109 110, 112 cross-entropy cost, SoftMax activation layer, 115 forward pass and backward pass, 114 hidden layer unit, 110 independent sigmoid output units, 111 multi-layer neural network, 113 neural networks, 114 partial derivative, 115 116 partial derivative, cost function, 112 113 propagating error, 109 sigmoid activation functions, 114 SoftMax function, 114 Softmax output layer, 114 pooling layer, 186 187 Backpropagation through time (BPTT), 256 Batch normalization, 204 206 Bayesian inference Bernoulli distribution, 282 likelihood function, 281 284, 286 likelihood function plot, 284 posterior distribution, 281 posterior probability distribution, 281, 283, 285 286 prior, 283 prior probability distribution, 283, 285 Bayesian networks, 38 Bayes rule, 38 Bernoulli distribution, 48 49 Bidirectional RNN, 276 278 Binary threshold activation function, 102 103 Binomial distribution, 49 Block Gibbs sampling, 305 Boltzmann distribution, 279 280 C Calculus, 23 convex function, 30 31 convex set, 29 30 differentiation, 23 24 gradient of function, 24 25 Hessian matrix of function, 25 local and global minima, 28 29 maxima and minima of functions, 26 for univariate function, 26 28 multivariate convex and non-convex functions, 31 33 Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, https://doi.org/10.1007/978-1-4842-3096-1 393

Calculus (cont.) non-convex function, 31 positive semi-definite and definite, 29 successive partial derivatives, 25 Taylor series, 34 Central Limit theorem, 53 Collaborative filtering contrastive divergence, 315 derived probabilities, 317 description, 313 energy configuration, 317 joint configuration, 316 matrix factorization method, 313 probability of hidden unit, 316 RBMs, 314 restricted Boltzmann View, user, 314 315 Continuous bag of words (CBOW) hidden-layer embedding, 230 hidden layer vector, 229, 231 SoftMax output probability, 231 TensorFlow implementation, 234 word embeddings, 228 229 Contrastive divergence, 308 309, 315 Convolutional neural networks (CNNs), 153 architectures, 206 AlexNet, 208 209 LeNet, 206 207 ResNet, 210 211 VGG16, 209 210 components, 179 convolution layer, 180 181 input layer, 180 pooling layer, 182 convolution operation, 153 2D convolution of image, 165 169 2D convolution of signal, 163 165 LTI/LSI systems, 153 155 signals in one dimension, 155 156, 162 163 digit recognition on MNIST dataset, 192 196 dropout layers and regularization, 190 191 elements, 153 image-processing filters, 169 Gaussian filter, 173 gradient-based filters, 174 175 identity transform, 177 178 Mean filter, 169 171 Median filter, 171 172 Sobel edge-detection filter, 175 177 for solving real-world problems, 196 203 translational equivariance, 188 189 pooling, 189 190 weight sharing, 187 Cross-correlation, 180 D Deep belief networks (DBNs) backpropagation, 318 implementation, 319 learning algorithm, 318 MNIST dataset, 318 RBMs, 317 ReLU activation functions, 319 schematic diagram, 317, 318 sigmoid units, 318 Deep learning evolution artificial neural networks, 89 92 artificial neuron structure, 90 biological neuron structure, 89 perceptron learning algorithms activation functions, hidden layers linear, 100 101 backpropagation (see Backpropagation, for gradient computation) geometrical interpretation, 96 97 hyperplane, classes, 93 limitations, 97 98 machine-learning domain, 94 non-linearity, 99 100 rule, multi-layer perceptrons network, 108 109 weight parameters vector, 95 vs. traditional methods, 116 117 Denoising auto-encoder, 333 E Elliptical contours, 123, 125 F Forget-gate value, 264 Fully convolutional network (FCN) architecture, 356 down and up sampling max unpooling, 360 transpose convolution, 361, 363 unpooling, 359 output feature maps, network, 357 358 pixel categories, 356 SoftMax probability, 357 G, H Gated recurrent unit (GRU), 274 276 Gaussian blur, 173 Generative adversarial networks (GANs) 394

agents zero-sum game, 378 cost function and training, 383 385 generative models, 378 illustration, 379 maximin and minimax problem, 379 380 minimax and saddle points, 382 383 neural networks, 378 TensorFlow implementation, 386 vanishing gradient, generator, 386 zero sum game, 381 Gibbs sampling bivariate normal distribution, 305 block, 305 burn in period, 306 conditional distributions, 305 generating samples, 306 Markov Chain Monte Carlo method, 304 restricted Boltzmann machines, 306 307 Global co-occurrence methods, 241 building word vectors, 243 244 extraction, word embeddings, 242 statistics and prediction methods, 240 SVD method, 241 word combination, 241 Word-embeddings plot, 245 word-vector embedding matrix, 242 Global minima, 28 GloVe, 245 Gradient clipping, 261 Gradient descent, backpropagation, 236 GradientDescentOptimizer, 130 Graphical processing unit (GPU), 152 I, J Image classification, 373 374 Image segmentation, 345 binary thresholding method, histogram, 345, 349 FCN (see Fully convolutional network (FCN)) K-means clustering, 352 Otsu s method, 346 349 semantic segmentation, 355 sliding window approach, 355 in TensorFlow implementation, semantic segmentation, 365 U-Net convolutional neutral network, 364 365 Watershed algorithm, 349 352 K Karush Kahn Tucker method, 78 K-means algorithm, 352 Kullback-Leibler (KL) divergence plot for mean, 327 sparse auto-encoders, 328 329 L Lagrangian multipliers, 79 Language modeling, 254 255 Lasso Regularization, 16 Linear activation function, 102 Linear algebra, 2 determinant of matrix, 12 interpretation, 13 Eigen vectors, 18 19 characteristic equation of matrix, 19 22 power iteration method, 22 23 identity matrix or operator, 11 12 inverse of matrix, 14 linear independence of vectors, 9 10 matrix, 4 5 matrix operations and manipulations, 5 addition of two matrices, 6 matrix working on vector, 8 product of two matrices, 6 product of two vectors, 7 subtractions of two matrices, 6 transpose of matrix, 7 norm of vector, 15 16 product of vector in direction of another vector, 17 18 pseudo inverse of matrix, 16 rank of matrix, 10 11 scalar, 4 tensor, 5 unit vector in direction of specific vector, 17 vector, 3 4 Linear shift invariant (LSI) systems, 153 155 Linear time invariant (LTI) systems, 153 155 Localization network, 373 374 Local minima point, 28 Long short-term memory (LSTM) architecture, 262 building blocks and function, 262 263 exploding-and vanishing-gradient problems, 263 264 forget gate, 263 output gates, 263 M, N Machine learning, 55 constrained optimization problem, 77 79 and data science, 2 dimensionality reduction methods, 79 principal component analysis, 80 83 singular value decomposition, 83 84 optimization techniques contour plot and lines, 68 70 gradient descent, 66 linear curve, 74 395

Machine learning (cont.) for multivariate cost function, gradient descent, 67 68 negative curvature, 75 Newton s method, 74 positive curvature, 76 77 steepest descent, 70 stochastic gradient descent, 71 73 regularization, 84 86 constraint optimization problem, 86 87 supervised learning, 56 classification, 61 64 hyperplanes and linear classifiers, 64 65 linear regression, 56 61 unsupervised learning, 65 Markov Chain, 288 Markov Chain Monte Carlo (MCMC) methods, 280 aperiodicity, 289 area of Pi, 287 computation of Pi, 287 detailed balance condition, 289 implementation, 289 irreducibility, 289 metropolis algorithm acceptance probability, 291 bivariate Gaussian distribution, sampling, 291 293 heuristics, 290 implementation, 290 transition probability function, 290, 291 probability zones, 287 sampling, 286 states, gas molecules, 288 stochastic/random, 288 transition probability, 288 Matrix factorization method, 313 Maximum likelihood estimate (MLE) technique, 52 53 Max unpooling, 360 Momentum-based optimizers, 136 137 Monte Carlo method, 287 Multi-layer Perceptron (MLP), 99 O Object detection fast R-CNN network, 377 R-CNN network, 376 377 sliding-window technique, 375 task, 375 Otsu s method, 346 349 Overfitting, 84 P, Q PCA and ZCA whitening advantage, 340 341 illustration, 340 342 pixels, 340 spatial structure, 341 techniques, 340 whitening transform, 341 Perceptron, 92 Points of inflection, 26 Principal component analysis, 279 See also PCA and ZCA whitening Probability, 34 Bayes rule, 38 chain rule, 37 conditional independence of events, 38 correlation coefficient, 44 covariance, 44 distribution Bernoulli distribution, 48 49 binomial distribution, 49 multivariate normal distribution, 48 normal distribution, 46 47 Poisson distribution, 50 uniform distribution, 45 46 expectation of random variable, 39 hypothesis testing and p value, 53 55 independence of events, 37 likelihood function, 51 MLE, 52 53 mutually exclusive events, 37 probability density function (pdf), 39 probability mass function (pmf), 38 skewness and Kurtosis, 40, 42 unions, intersection, and conditional, 35 37 variance of random variable, 39 40 R Rectified linear unit (ReLU) activation function, 106 Recurrent neural networks (RNNs) architectural principal, 252 bidirectional RNN, 276 278 BPTT, 256 component, 253 254 embeddings layer, 252 folded and unfolded structure, 252 GRU, 274 276 language modeling, 254 255 LSTM, 262 263 MNIST digit identification, TensorFlow Alice in Wonderland, 273 implementation, LSTM, 266 396

input tensor shape, LSTM network, 265 next-word prediction and sentence completion, 268 traditional language models, 255 vanishing and exploding gradient problem gradient clipping, 261 LSTMs, 263 264 memory-to-memory weight connection matrix and ReLU units, 261 sigmoid function, 259 temporal components, 259 Restricted Boltzmann machines (RBMs) Block Gibbs sampling, 305 collaborative filtering binary visible unit, 315 contrastive divergence, 315 hidden units, 314 315, 317 joint configuration, 316 Netflix Challenge, 314 probability of hidden unit, 316 schematic diagram, matrix factorization method, 313 SoftMax function, 315 three-way energy configuration, 317 conditional probability distribution, 296 contrastive divergence, 308 309 DBNs (see Deep belief networks (DBNs)) deep networks, 294 discrete variables, 297 Gibbs sampling, 304 308 graphical probabilistic model, 295 implementation, MNIST dataset, 309 joint configuration, 295 joint probability distribution, 295, 298 machine learning algorithms, 294 partition function Z, 295 sigmoid function, 299 symmetrical undirected network, 299 training, 299 visible and hidden layers architecture, 294 Ridge regression, 86 Ridge regularization, 16 RMSprop, 131 132 S Saddle points, 127, 129, 382 383 Semantic segmentation, 355 in TensorFlow, FCN network, 365 Sigmoid activation function, 103 104 Singular value decomposition (SVD), 240 241, 313, 340 Skip-gram models, 236 TensorFlow implementation, 240 word embedding, 235 237 Sliding window approach, 355 SoftMax activation function, 104 105 Sparse auto-encoders hidden layer output, 329 hidden layer sigmoid activations, 328 hidden structures, input data, 328 implementation, TensorFlow, 329 Stochastic gradient descent (SGD), 71, 127 Supremum norm, 15 T Tanh activation function, 107 Taylor series expansion, 34 TensorFlow commands, define check Tensor shape, 120 explicit evaluation, 120 Interactive Session() command, 119 121 invoke session and display, variable, 121 Numpy Array to Tensor conversion, 122 placeholders and feed dictionary, 122 TensorFlow and Numpy Library, 119 TensorFlow constants, 120 TensorFlow variable, random initial values, 121 tf.session(), 121 variables, 121 variable state update, 122 deep-learning packages, 118 features, deep-learning frameworks, 118 119 gradient-descent optimization methods elliptical contours, 123, 125 non-convexity of cost functions, 126 saddle points, 127, 129 installation, 119 linear regression actual house price vs. predicted house price, 146 cost plot over epochs, 145 implementation, 143 meta graph definition, 390 mini-batch stochastic gradient descent, rate, 129 models deployment, production, 389 392 multi-class classification, SoftMax function full-batch gradient descent, 146 stochastic gradient descent, 149 optimizers AdadeltaOptimizer, 133 134 AdagradOptimizer, 130 131 AdamOptimizer, 135 batch size, 138 epochs, 138 GradientDescentOptimizer, 130 397

TensorFlow (cont.) MomentumOptimizer and Neterov Algorithm, 136 137 number of batches, 138 RMSprop, 131 132 XOR implementation computation graph, 140 141 hidden layers, 138 linear activation functions, hidden layer, 142 Traditional language models, 255 Transfer learning, 211 with Google InceptionV3, 213 214, 216 guidelines, 212 with pre-trained VGG16, 216 219, 221 Transpose convolution, 361, 363 U U-Net architecture, 364 Unpooling, 359 V Vector representation of words, 227 Vector space model (VSM), 227 W, X, Y Watershed algorithm, 349 352 Word-embeddings plot, 245 Word-embedding vector, 228 230 Word2Vec CBOW method (see Continuous bag of words (CBOW)) global co-occurrence methods, 240 GloVe, 245 skip-gram models, 235 237 TensorFlow implementation, CBOW, 231 word analogy, word vectors, 249 Word-vector embeddings matrix, 242 Z Zero sum game, 381 398