arxiv: v1 [cs.cv] 22 Aug 2017

Size: px
Start display at page:

Download "arxiv: v1 [cs.cv] 22 Aug 2017"

Transcription

1 1 On Image Classification: Correlation v.s. Causality arxiv: v1 [cs.cv] 22 Aug 2017 ZHEYAN SHEN, Tsinghua University PENG CUI, Tsinghua University KUN KUANG, Tsinghua University BO LI, Tsinghua University Image classification is one of the fundamental problems in computer vision. Owing to the availability of large image datasets like ImageNet and YFCC100M, a plethora of research has been conducted to do high precision image classification and many remarkable achievements have been made. The success of most existing methods hinges on a basic hypothesis that the testing image set has the same distribution as the training image set (i.e. the i.i.d. hypothesis). However, in many real applications, we cannot guarantee the validity of the i.i.d. hypothesis since the testing image set is unseen. It is thus desirable to learn an image classifier, which can perform well even in non-i.i.d. situations. In this paper, we propose a novel Causally Regularized Logistic Regression (CRLR) algorithm to address the non-i.i.d. problem without knowing testing data information by searching for causal features. The causal features refer to characteristics truly determining whether a special object belongs to a category or not. Identifying causal features allows us to construct classifiers adaptive to distributional changes in the non i.i.d circumstances even when the testing set is unseen. Algorithmically, we propose a causal regularizer for causal feature identification by jointly optimizing it with a logistic loss term. Assisted with the causal regularizer, we can estimate the causal contribution (causal effect) of each focal image feature (viewed as a treatment variable) by sample reweighting which ensures the distributions of all remaining image features between images with different focal feature levels are close. The resultant classifier will be based on the estimated causal contributions of the features, rather than traditional correlation-based contributions. To validate the effectiveness of our CRLR algorithm, we manually construct a new image dataset from YFCC100M 1, simulating various non-i.i.d. situations in the real world, and conduct extensive experiments for image classification. Experimental results clearly demonstrate that our CRLR algorithm outperforms the state-of-the-art methods. We further visualize the top causal features selected by our algorithm on our image dataset. CCS Concepts: Computing methodologies Object recognition; Regularization; Additional Key Words and Phrases: Image Classification, Causal Inference, Non-i.i.d Situations, Causally Regularized Logistic Regression ACM Reference format: Zheyan Shen, Peng Cui, Kun Kuang, and Bo Li On Image Classification: Correlation v.s. Causality. 1, 1, Article 1 (August 2017), 16 pages. 1 Yahoo Flickr Creative Commons WebScope dataset Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org Association for Computing Machinery. XXXX-XXXX/2017/8-ART1 $

2 1:2 Z. Shen et al Training Data Category Level Causal Feature Distribution Category Level Correlation Feature Distribution Training Phase feature visualization Sample-Level Correlation Feature Distribution i.i.d. case Sample-Level Causal Feature Distribution Sample-Level Correlation Feature Distribution Non-i.i.d. case Sample-Level Causal Feature Distribution Testing Phase Fig. 1. Illustration of the difference between correlation based and causality based classification methods in addressing non-i.i.d. cases. 1 INTRODUCTION Image classification has advanced rapidly in recent years. The exponentially increased image data scale and the significantly improved model capacity ensure that the correlation between image pixels and image categories can be maximally leveraged. The success of these correlation-based image classification methods depends on a basic hypothesis that the testing image set has the same distribution as the training image set (i.e. the i.i.d. hypothesis). In this case, the correlations between image pixels and categories in training set provide sufficient predictive power for unseen testing images. However, a long-standing but largely ignored problem is what if the i.i.d. hypothesis does not hold anymore? In real applications, we cannot guarantee the validity of the i.i.d. hypothesis since the testing images are unseen. The typical non-i.i.d. situations include context bias and label composition bias or both. The context bias refers to situations where the training data and testing data have different distributions over visual contexts (e.g. for a dog category, the environment like grass is a visual context). The label composition bias is induced by different percentages of positive and negative labels in training and testing sets for some categories. These non-i.i.d. situations pose great challenges to correlation-based methods, because the correlation patterns in the training set may differ from those in the testing set, and hence cannot be exploited for predicting in the testing set. An example of non-i.i.d. situation induced by context bias is illustrated in Figure 1. The classifier for dogs is trained by images mostly with dogs on the grass, while tested by an image with a dog in grass context (i.e. i.i.d case) and another image with a dog in snow context (i.e. non i.i.d. case). The correlation-based method can succeed in the i.i.d. example, but fail in the non-i.i.d. example. The failure is mainly because the grass features are assigned with high weights in the classifier due to the fact that they are highly correlated with the label in the training set, but they do not appear in the testing image.

3 On Image Classification: Correlation v.s. Causality 1:3 Recently, there are some works addressing the non-i.i.d. problem by covariate shifting with the objective to minimize the predictive loss under testing data distribution by reweighting the training samples [Huang et al. 2007; Liu and Ziebart 2014; Shimodaira 2000; Sugiyama et al. 2007; Zadrozny 2004]. But these works need prior knowledge of the testing data distribution, which is unavailable in many non-batch applications. In this work, we aim to simulate the existence of various non-i.i.d. settings and address them without knowing the testing data distribution, which is distinct from the above work and more reasonable in real applications. We address the non-i.i.d. problem by discriminating the causal features that truly determine whether a specific object belongs to a category from the correlation features. As shown in Figure 1, if the classifier is based on the causal features of the dog category such as the fur, nose and ear features, it would be insensitive to context bias and thus avoid errors in non-i.i.d. situations. Causal inference is a powerful statistical tool for discovering causal relationship between two variables. A gold standard for causal analysis is to conduct controlled experiments like A/B testing. But controlled experiments are infeasible in scenarios like image classification where micro-level features cannot be manipulated. Causal analysis methods based on observational data include propensity score matching or reweighting [Austin 2011; Bang and Robins 2005; Kuang et al. 2017], markov blankets [Aliferis et al. 2003; Koller and Sahami 1996; Pellet and Elisseeff 2008; Yu and Liu 2004] and confounder balancing [Athey et al. 2016; Hainmueller 2011; Zubizarreta 2015] etc. The key problem that these methods trying to solve is to remove the confounding bias induced by the different confounder distributions of the treated and control units. In pursuit of marrying causal analysis with image classification, we successively regard each image feature (e.g. a visual word) as a treatment variable, and all other image features as confounders. A straightforward approach is to first select causal features and then learn a classifier based on the causal features. However, this approach is statistically sensitive to the threshold for causal feature selection, and the step-by-step method is difficult to optimize in practice. So how to develop a unified approach for joint causal analysis and image classification remains as a challenging problem. Moreover, existing causal analysis methods are proposed in well-designed settings where typically a small number of treatment variables are considered. While in image classification, we have little prior knowledge on causal relationships, and thus have to regard all features as treatment variables. This leads to a large scale causal effect learning problem involving a large number of potential causal treatment variables to which existing methods are not directly applicable. Hence, It is highly non-trivial to design a scalable causal learning method adaptive to the image classification setting. In this paper, we propose a novel Causally Regularized Logistic Regression (CRLR) model to address the non-i.i.d. problem in image classification. The model consists of a logistic loss term and a causal regularizer. The latter makes the model prone to exploit causal features for prediction. Specifically, the causal regularizer aims to directly balance confounder distributions for each treatment feature through reweighting the sample. In order to reduce model complexity, we propose a global sample reweighting method which learns a common sample reweighting matrix to maximally balance confounders for all treatment features. The sample weights are also taken into account when calculating the logistic loss, and, in this way, the logistic loss and causal regularizer are jointly optimized. The notion behind the method is to resample the images so that the training images of a given category have uniform distribution across different contexts. For example, in a dog category, the method will enforce the total weight of images in different contexts like grass, beach and car to be almost the same. In this case, minimizing the logistic loss will lead to higher weights on the dog features rather than the context features. In order to evaluate the performance of a classification method in the non-i.i.d. situations, we construct a new image dataset from YFCC100M. This dataset includes 10 categories, and the images in a category are divided into 5 contexts. For example, in the dog category, the 5 contexts are grass,

4 1:4 Z. Shen et al beach, car, sea, and snow. With this dataset, we can easily simulate various non-i.i.d. situations by tuning the training and testing distributions. The technical contributions of this paper are four-fold: We investigate a new problem of image classification in non-i.i.d. situations induced by unseen testing data, which is long-standing but largely ignored in literature. We propose a novel Causally Regularized Logistic Regression model to address this problem, where the causal analysis and image classification are jointly optimized in an effective and efficient way. We conduct extensive experiments, and the results demonstrate the inability of existing methods in dealing with the non-i.i.d. situations, and the superiority of our method in such scenarios. The explainability of our method is also a notable merit. We have shared the new dataset 2 we constructed for evaluating image classification in non-i.i.d. situations to promote future research in this direction. The remaining sections are organized as follows. Section 2 reviews the related work. Section 3 describes the problem formulation and our CRLR algorithm. Section 4 optimizes the CRLR algorithm and analyzes its complexity. Section 5 gives the experimental results. Finally, Section 6 concludes the paper. 2 RELATED WORK The previous related works can be categorized into non-i.i.d. learning methods and causal inference methods which we briefly review and discuss in this section. Non-i.i.d. learning problem, particularly the covariate shift problem, has been investigated in machine learning field. Their main ideas are to shift the training set distribution to the same with the testing set distribution. Zadrozny et al. [Zadrozny 2004] used rejection sampling method to correct the training set distribution into the same as testing set. Huang et al. [Huang et al. 2007] presented a nonparametric method to match distributions between training and testing set in the feature space. Sugiyama et al. [Sugiyama et al. 2007] designed an importance weighted cross validation for model selection under non-i.i.d. situations. Liu et al. [Liu and Ziebart 2014] proposed a robust bias-aware probabilistic classifier by reweighting the training distribution via a minimax estimation formulation. Most of these methods require prior knowledge on the testing set distribution, which is infeasible in many real applications including non-batch image classification. In this work, we address the non-i.i.d. problem using only the training set, which is more reasonable in real scenario. Causal inference is a powerful statistical modeling tool for explanatory analysis. The major question in estimating causal effect is to balance the distributions of confounders across different treatment levels. Rosenbaum and Rubin [Rosenbaum and Rubin 1983] proposed to achieve the balance by propensity score matching or reweighting. Methods based on propensity scores have been widely used in various fields, including economics [Stuart 2010], epidemiology [Funk et al. 2011], health care [Dos Reis and Culotta 2015], social science [Lechner 1999] and advertising [Sun et al. 2015]. But these methods can only handle one or a few treatment variables and cannot be directly applied in image classification in which a huge number of features are viewed as potential treatment variables. There is a growing recent literature proposing to directly optimize sample weights to balance confounder distributions. Hainmueller [Hainmueller 2011] introduced entropy balancing to directly adjust sample weights by the specified sample moments. Athey et al. [Athey et al. 2016] proposed approximate residual balancing for sample weights learning via a lasso residual regression adjustment. Zubizarreta [Zubizarreta 2015] learnt the stable balancing weights via minimizing its variance and adjust them for confounder balancing. These methods provide an 2

5 On Image Classification: Correlation v.s. Causality 1:5 effective way to estimate causal effects without prior on knowledge structure, but they reweight samples targeting a single treatment variable. We will adapt the reweighting balance technique to large-scale causal effect exploration settings we target. Recently, Lopez-Paz et al. [Lopez-Paz et al. 2017] combined a neural network and causal framework to identify the causal signals in visual images. Their work has the potential to be extended for causal-based classification. But it learns the causal features one by one and errors may accumulate fast given the huge number of features in images. 3 CAUSAL CLASSIFICATION In this section, we first provide the formulation of causality-based image classification problem (called causal classification later for brevity). After that, we present some preliminaries on confounder balancing to make the paper self-contained, followed by a detailed introduction to our proposed Causally Regularized Logistic Regression (CRLR) method. 3.1 Problem Formulation As stated in the introduction, the problem we focus on in this paper is the non-i.i.d. problem in image classification, which can be formulated as following: Problem 1 (Non-i.i.d. problem in image classification). Given the training image set D tr ain = (X tr ain,y tr ain ), where X tr ain R n p represents the image features and Y tr ain R n 1 represents the image label, the task is to learn an image classifier f θ ( ) with parameter θ to precisely predict the label of testing image set D test = (X test,y test ), where D test is unseen and its distribution Ψ(D test ) Ψ(D tr ain ). The non-i.i.d. problem means the different distribution between training and testing image set, namely Ψ(D test ) Ψ(D tr ain ), including the different distribution of image features Ψ(X test ) Ψ(X tr ain ), and image labels Ψ(Y test ) Ψ(Y tr ain ) or both. To address the non-i.i.d. problem, we adapt causal inference to analyze the causal contribution of each image feature on image label and identify the causal features that truly determine whether an image belongs to a category or not. By adapting causal inference to image classification, we can regard each image feature X j as a treated variable (i.e. treatment), all the remaining features X j = X \ X j as confounding variables (i.e. confounders), and the image category Y as the outcome variable. Given a feature and a label, when the feature occurs (or does not occur) in a image, the image becomes a treated (or control) image. To safely estimate the causal contribution of a given image feature X j on image label Y, one have to remove the confounding bias induced by the different distributions of confounder X j between the treated and control image sets. After removing the confounding bias, the difference of image label Y between treated and control image sets can be seen as the causal contribution of feature X j on the image category Y. With causal analysis on image classification, we can identify the causal contribution β R p 1 for all image features, which is robust and insensitive to the distributional changes of unseen testing image set. Namely, the causal contribution for all image features on training set β tr ain β test, even though the testing set is unseen and Ψ(D test ) Ψ(D tr ain ). We address the non-i.i.d. problem in image classification by following causal classification problem. Problem 2 (Causal Classification Problem). Given the training image set D = (X,Y), where X R n p represents the image features and Y R n 1 represents the image category, the task is to identify the causal contribution β for all image features and jointly learn a image classifier f β ( ) based on β for image classification.

6 1:6 Z. Shen et al Table 1. Symbols and definitions. Symbol n p X R n p I R n p Y R n 1 W R n 1 β R p 1 Definition Number of images Dimension of image features Image features Indicator for Treated and Control Image category (outcome) Sample weight Causal contribution of feature The key challenge in causal classification problem is how to jointly optimize the causal contribution identification and image classification. In our paper, we propose a synergistic learning algorithm composed of causal regularizer and logistic loss term. In which causal regularizer balances the confounder distributions associated with each treatment feature by sample reweighting, and causal contributions for all image features are estimated from logistic regression on the reweighted sample. The output logistic regression model gives a causal classification rule. Our proposed synergistic optimization program is a one-step method and can be tuned easily. 3.2 Confounder Balancing In observational studies, the confounder distributions need to be balanced to correct for bias from the non-random treatment assignments. Confounder balancing approaches exploit moments to characterize distributions, as moments can uniquely determine a distribution. Instead of balancing confounder distributions, they directly balance confounder moments by adjusting weights of samples. The sample weights W are learned by: W = arg min W X t W j X j 2 2. (1) Given a treatment feature T, the X t and j:t j =0 W j X j represent the mean value of confounders on samples with and without treatment, respectively. Note that only first-order moment is considered in Eq. 1 and higher order moments can be easily incorporated by including more features. We will adapt the sample reweighting technique to simultaneously balance confounder distributions associated with all treatment features. We will elaborate on this in Section Causally Regularized Logistic Regression Inspired by the confounder balancing method, we propose a causal regularizer to reweight training images by successively setting each image feature as treatment variable and all remaining features as confounders. Our proposed causal regularizer is: p X j T (W I j ) 2 j=1 2, (2) j:t j =0 X j T (W (1 I j )) W T I j W T (1 I j ) where W is the sample weights. X j T (W I j ) X j T (W (1 I j )) 2 represents the loss of confounder W T I j W T (1 I j ) 2 balancing when setting image feature j as treatment variable, and X j is all the remaining features (i.e. confounders), which is from X by replacing its j th column as 0. The I j means the j th column of I, and I ij refers to the treatment status of unit i when setting feature j as treatment variable. The notion behind this regularizer is that when we estimate the causal effect of treatment variable, we

7 On Image Classification: Correlation v.s. Causality 1:7 are trying to control the distributions of other variables(confounders) to be as much as possible consistent in different level of treatment variable. By incorporating the causal regularizer in Eq. 2, we give our Causally Regularized Logistical Regression (CRLR) algorithm to jointly optimize sample weigh W and causal contribution β for causal classification based on logistic regression as: min n i=1 W i log(1 + exp((1 2Y i ) (x i β))), (3) s.t. p X j T (W I j ) 2 j=1 2 λ 1, X j T (W (1 I j )) W T I j W T (1 I j ) W 0, W 2 2 λ 2, β 2 2 λ 3, β 1 λ 4, ( n k=1 W k 1) 2 λ 5, where n i=1 W i log(1 + exp((1 2Y i ) (x i β))) represents the loss of logistic regression after sample reweighting, where x i is the i th row of X, represents the features of unit i. Elastic net constraints β 2 2 λ 3 and β 1 λ 4 help avoid overfitting. The termsw 0 constrain each of sample weights to be non-negative. With norm W 2 2 λ 2, we can reduce the variance of the sample weights to achieve stability. The formula ( n k=1 W k 1) 2 λ 5 avoids all the sample weights to be zero. In the traditional logistic regression model, the coefficients capture the correlation between the features and the category label. While the highly correlated features do not imply causation due to confounding bias. The sample weights produced from the causal regularizer are capable of correcting the bias. The estimated coefficients from the CRLR can thus be viewed as causal contributions of the features, which can also be ranked after feature standardization. 4 OPTIMIZATION In this section, we give the optimization details of our CRLR algorithm and analyze its complexity. 4.1 Algorithm The goal for optimizing the aforementioned model in Eq. 3 is equivalent to optimize following problem, which is to minimize J(W, β) with constraint on parameters W and β. J(W, β) = n i=1 W i log(1 + exp((1 2Y i ) (x i β))) (4) p +λ 1 X j T (W I j ) j=1 X j T (W (1 I j )) 2 W T I j W T (1 I j ) 2 +λ 2 W λ 3 β λ 4 β 1 + λ 5 ( n k=1 W k 1) 2 s.t. W 0. It is difficult to get an analytical solution for the final optimization problem in Eq. 4. We solve it with iterative optimization algorithm. Firstly, we initialize sample weight W and causal contribution β. Once the initial values are given, in each iteration, we first update β by fixing W, and then update W by fixing β. These steps are described below: Update β: When fixing W, the problem (4) is equivalent to optimize following objective function: J(β) = n i=1 W i log(1 + exp((1 2Y i ) (x i β))) (5) +λ 3 β λ 4 β 1 which is a standard l 1 norm regularized least squares problem and can be solved with any LASSO (or elastic net) solver. Here, we use the proximal gradient algorithm [Parikh et al. 2014] with proximal operator to optimize the objective function in (5).

8 1:8 Z. Shen et al Update W : By fixing β, we can obtain W by optimizing (4). It is equivalent to optimize following objective function: J(W ) = n i=1 W i log(1 + exp((1 2Y i ) (x i β))) (6) p +λ 1 X j T (W I j ) j=1 X j T (W (1 I j )) 2 W T I j W T (1 I j ) 2 +λ 2 W λ 5( n k=1 W k 1) 2 s.t. W 0. For ensuring non-negativity of W, we let W = ω ω, where ω R n 1 and refers to the Hadamard product. Then the problem (6) can be reformulated as: J(ω) = n i=1 (ω i ω i ) log(1 + exp((1 2Y i ) (x i β))) (7) p + λ 1 X j T (ω ω I j ) j=1 X j T (ω ω (1 I j )) 2 (ω ω) T I j (ω ω) T (1 I j ) 2 + λ 2 ω ω λ 5( n k=1 ω k ω k 1) 2 The partial gradient of term J(ω) with respect to ω is: J(ω) ω where 1 T = (1, 1,, 1), and } {{ } p = p j=1 4 ( J b ω (ω 1T ) ) T Jb, J b = XT j (ω ω I j) (ω ω) T XT j (ω ω (1 I j)) I j (ω ω) T, (1 I j ) J b ω = X T j (I j 1 T ) ((ω ω) T I j ) X j T (ω ω I j ) Ij ) T 2 ((ω ω) T I j X j T ((1 I j ) 1 T ) ((ω ω) T (1 I j )) X j T (ω ω (1 I j )) (1 I j ) ( ) T 2 (ω ω) T (1 I j ) Then we determine the step size a with line search, and update ω at t th iteration as: and update W (t) at t th iteration with: ω (t) = ω (t 1) a J(ω(t 1) ) ω (t 1), W (t) = ω (t) ω (t). We update β and W iteratively until the objective function (4) converges. The whole algorithm is summarized in Algorithm 1. Finally, with the optimized causal contribution β by our CRLR algorithm, we can make causal classification with interpretable insights on any testing image set. 4.2 Complexity Analysis During the procedure of optimization, the main cost is to calculate the loss J(W, β), update causal feature weights β and sample weights W. We analyze the time complexity of each of them respectively. For the calculation of the loss, its complexity is O(np 2 ), where n is the sample size and p is the dimension of variables. For updating β, this is standard LASSO problem and its complexity is O(np). For updating W, the complexity is dominated by the step of calculating the partial gradients of function J(ω) with respect to variable ω. The complexity of J(ω) ω is O(np 2 ).

9 On Image Classification: Correlation v.s. Causality 1:9 Algorithm 1 Causal Regularized Logistic Regression (CRLR) Input: Tradeoff parameters λ 1 > 0, λ 2 > 0, λ 3 > 0, λ 4 > 0, λ 5 > 0, Variables Matrix X and Outcome Y. Output: Causal Contribution β and Sample Weight W 1: Calculate Indicator Matrix I from Variables Matrix X. 2: Initialize Causal Contribution β (0), Sample Weight W (0) 3: Calculate the current value of J(W, β) (0) = J(W (0), β (0) ) with Equation (4) 4: Initialize the iteration variable t 0 5: repeat 6: t t + 1 7: Update β (t) by solving J(β (t 1) ) in Equation (5) 8: Update W (t) by solving J(W (t 1) ) in Equation (6) 9: Calculate J(W, β) (t) = J(W (t), β (t) ) 10: until J(W, β) (t) converges or max iteration is reached 11: return β, W. In total, the complexity of each iteration in Algorithm 1 is O(np 2 ). 5 EXPERIMENTS In this section, we introduce the experimental settings, present results of comparative study in three non i.i.d. situations, and visualize the explainable causal features. 5.1 Experimental Settings Dataset. In order to simulate the potential non-i.i.d. situations in real world, we manually construct a 10-category dataset based on images from YFCC100M [Thomee et al. 2016]. YFCC100M dataset provides 100 million images and each image contains multiple tags. In constructing our dataset, we first select a major object tag (e.g. dog) as the category label, and select 5 other context tags which are frequently co-occurred with the major tag (e.g. grass, beach, car). We then combine the major tag and a context tag as the query to retrieve images from YFCC100M. After scrutinizing the image contents to guarantee their correctness w.r.t. the category and context, we get a number of images of a certain context in a category. In total, we organize 10 categories with each containing 5 contexts. The details of the dataset is described in Table 2. Table 2. Statistics of our dataset with 10 categories, where each category has 5 contexts. Context 1 Context 2 Context 3 Context 4 Context 5 Total bird duck(210) gull(200) hawk(200) heron(200) parrot(190) 1000 bridge san francisco(160) london(110) nyc(110) street(100) sydney(180) 660 car art(114) bmw(120) classic(200) ferrari(200) racing(180) 814 cat black(180) house(120) kitten(200) tabby(200) white(240) 940 church basilica(94) catholic(83) gothic(104) orthodox(100) roman(81) 462 dog beach(200) car(150) grass(200) home(200) snow(190) 940 flower blossom(200) lily(240) orchid(240) rose(220) tulip(190) 1090 horse dressage(260) equestrian(206) jumping(200) pony(50) racing(140) 856 train diesel(250) locomotive(230) metro(100) station(68) steam(150) 798 tree christmas(140) leaves(220) palm(170) snow(160) spring(170) 860

10 1:10 Z. Shen et al Image Representation. For ease of visualization and interpretation, we use visual-words [Csurka et al. 2004] as features to represent images. In details, we first detect interest points with 8*8 grids on images. For the sake of brevity and generality, we define a gird (8*8, 16*16, etc) to extract feature points. We then use SURF descriptor (speeded up robust features) [Bay et al. 2006] to quantize each feature points into numerical feature vector. Finally, we apply k-means [Hartigan and Wong 1979] clustering algorithm on the feature descriptors extracted from the whole image set and get 500 visual clusters. By the k-means algorithm, we can assign each feature descriptor into one of the k mutually exclusive clusters. Each cluster center is defined as a "visual word", and each feature descriptor can be assigned into a visual word by nearest neighbor. Then each image is encoded into a 500-dimensional feature vector with each dimension being a binary variable to indicate whether a visual word occurs in the image Baselines. We implement following competitive baseline classifiers to compare with our algorithm. LR [Menard 2002]: Logistic Regression (LR) is a typical correlation-based method and has been widely used in many classification problems. LR+L 1 [Tibshirani 1996]: To avoid overfitting on LR model, we impose L 1 regularizer on Logistic Regression. SVM [Cortes and Vapnik 1995]: Support Vector Machine (SVM) is another typical classification method, and we use the SVM with kernel method as a baseline. Two-Step: It is a straight-forward two step solution which first performs causal feature selection via confounder balancing [Athey et al. 2016] and then apply Logistic Regression. We tuned the number of top causal features, and reported the results with the optimal one. MLP [LeCun et al. 1998]: We implement a Multi-layer Perceptron (MLP) as a baseline classifier. After tuning the neural network structure, we adopt the optimal structure on validation set with 3 hidden layers( ) in experiments. We tuned the parameters in our algorithm and baselines via cross validation by gird searching with validation set. 5.2 Experimental Results In this section, we report our experimental results under three different non-i.i.d. situations, including radical context bias situation, moderate context bias situation, we explore a more general condition which the distribution of contexts differs between training and testing set. In the label composition bias situation, we investigate a common problem that the percentage of positive samples varies from training to testing set Radical Context Bias. Settings. In this experiment, we simulate the non-i.i.d. situation by splitting different contexts into training, validation and testing set. For each category, we use context 1,2,3 for training, context 4 for validation and context 5 for testing. Moreover, we perform a non-uniform sampling among different contexts in the training set and make the context 1/2/3 occupies 0.66/0.17/0.17 percentage respectively. This setting is consistent with the natural phenomena that visual concepts follow a power-law distribution [Clauset et al. 2009], indicating that only a few visual concepts are common and the rest majority are rare. We transfer this into visual contexts with a similar notion. Results. We report the performances in Accuracy and F1 in Table 3. From the results, we have following observations. (1) Our CRLR model achieves the best performance in almost all categories

11 On Image Classification: Correlation v.s. Causality 1:11 Table 3. Results of classifiers under non-i.i.d. situation with radical context bias in data. bird bridge car cat church Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 LR LR+L SVM Two-Step MLP CRLR dog flower horse train tree Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 LR LR+L SVM Two-Step MLP CRLR (9/10). Since the major difference between CRLR and a standard Logistic Regression model is the causal regularizer, we can safely attribute the significant improvement to the effective confounder balancing term and its seamless joint with logistic regression model. (2) The performance of the two-step approach is much worse than CRLR, which clearly demonstrate the importance of jointly optimizing causal feature selection and classification. (3) Not surprisingly, the correlation-based classification methods do not work well in this setting, mainly because they erroneously put correlational but non-causal features in important positions, leading to their sensitivity to context change Context Bias 5 0 bird bridge car cat church dog flower horse train tree Fig. 2. The relationship between our CRLR algorithm performance and context bias on each category. The more context bias in data, the more relative F1 improvement of our CRLR algorithm Relative F1 Improvement Insightful Analysis. An interesting question is to validate whether CRLR can perform much better in categories where bias is more serious. Here we quantify the bias level of a category with the EMD distance between the average feature vector of training images and the average feature vector of testing images. We also quantify the superiority of CRLR by its relative F1 improvement over

12 1:12 Z. Shen et al the best baseline. Then we show the results in Figure 2. We can see that relative F1 improvement and the category bias level are correlated to some degree. The extreme cases are more obvious. For example, dog category is most biased where our CRLR s relative improvement in F1 can reach about 50%. In contrast, the bias in the church category is not obvious, which can account for CRLR s ordinary performance in church category in Table Moderate Context Bias. Settings. In this experiment, we explore a more general condition where the training and testing set consists of the same contexts but with different percentages. The training set is constructed in the same way as Section 5.2.1, where the percentages of context 1, 2 and 3 are 0.66, 0.17 and 0.17 respectively. In order to simulate different levels of context bias between training and testing set, we construct the testing data also from context 1,2 and 3, but vary the percentage of context 1 occupying the whole testing set from 0.2 to 0.8, and the remained percentage is equally divided by context 2 and 3. For each setting, we execute our algorithm and baselines in all 10 categories and report the average F1. Average F LR LRL1 SVM Two-Step MLP CRLR Percentage of Context 1 in testing set Fig. 3. Average F1 performance on different context bias. Results. The results are shown in Figure 3. From the results, we can see that our proposed CRLR algorithm outperforms the baselines at different levels of context bias. The most competitive baseline is SVM. By comparing CRLR and SVM, we can see that CRLR has more obvious advantage when the percentage of context 1 is less. Considering that context 1 dominates the training set with 0.66 percentage, the result is easy to understand as the less percentage for context 1 in testing set implies larger context bias between training and testing set. Besides, the point between 0.6 and 0.7 in the percentage of context 1 should also be noted. At that point, the percentages of context 1,2, and 3 in testing set is around 0.66, 0.17 and 0.17 respectively, which are almost the same as training set. This implies an i.i.d. situation. In this situation, the performance of CRLR is almost the same as SVM. These results demonstrate that CRLR can significantly outperform correlation-based methods in non-i.i.d. situations, while perform equally well as correlation-based methods in i.i.d. situations Label Composition Bias. Settings. In this experiment, we consider a common situation that the percentage of positive and negative samples are different in training and testing set. We use context 1,2,3 for training, context 4 for validation and context 5 for testing. For each context, we fix the positive sample rate to be 25% in training set, while vary the positive sample rate in

13 On Image Classification: Correlation v.s. Causality 1:13 testing set from 0.1 to 0.9. For each percentage of positive samples in testing set, we report the average F1 over 10 categories. Average F LR LRL1 SVM Two-Step MLP CRLR Percentage of Positive Samples Fig. 4. Average F1 performance on different label composition bias. Results. From Figure 4, we can see that our CRLR algorithm performs the best at all settings, and our CRLR gets more superiority over baselines as the percentage of positive samples in testing set increasing. When the percentage reaches 0.9, our CRLR can improve the baselines from 0.58 to 0.73 in average F1 among all categories. As we know that when the training set is dominated by negative samples, the traditional classifiers would assign higher weights to negative features and be cautious to give positive predictions. Then if the testing set is dominated by positive samples, these classifier cannot work well. But our CRLR model always poses emphasis on causal features, and are robust to the scenarios where positive samples are rather sparse. This merit is fully demonstrated by Figure Feature Visualization and Explanation. Another important goal of introducing causality into image classification is to make the image classification models more explainable. The previous classification models, especially deep learning models, are typical black-box models which are hardly explainable. Explainable models are very much desired in many applications especially the ones involving people to make decisions. To demonstrate the interpretability of our method, we visualize the top-5 features in each category selected by CRLR and LR respectively. Due to space limitation, we only show some examples in 4 categories in Figure 5. We can see that most of the features selected by CRLR are positioned on the major object. For example, in the dog category, the selected features by CRLR are indeed from dog nose, ear, fur etc, which are causal features to determine whether an image belongs to the dog category. In contrast, the many of the features selected by LR are context features. From the explainable angle of view, CRLR can provide sufficient explanations on why it classifies an image into the dog category because it detects the causal features like dog nose and fur. As for LR, the correlational features are often difficult to interpret. But we still find that our method would exploit correlation features in some cases, as depicted in Figure 5.(m) and 5.(o). It might because the bias level in the train category is fairly low, which weakens the effect of the causal regularizer.

14 1:14 Z. Shen et al (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) Fig. 5. Top 5 features selected by CRLR and Logistic Regression, the red boxes indicate the features that CRLR selects and the green boxes indicate the features that Logistic Regression selects. Note that each feature represents a visual word and may corresponds to multiple bounding boxes, so the number of red and green boxes may not be equal bird bridge dog train Average F λ 1 Fig. 6. Sensitivity analysis with respect to λ Parameter Sensitivity. In this section, we investigate the parameter sensitivity in our CRLR algorithm. As λ 2 to λ 5 are the weights of commonly used regularizers, we evaluate the effect of parameter λ 1 on the results. λ 1 is eventually a trade-off parameter to control the relative weights of logistic regression and causal regularizer. More intrinsically, it controls the trade-off between predictive power and the degree of bias balancing. Because the results of different categories show

15 On Image Classification: Correlation v.s. Causality 1:15 similar tends, we just use four categories as examples(bird, Bridge, Dog, Train) for brevity. The value of λ 1 varies from {0.1, 0.3, 1, 3, 5, 10, 15}. We plot the results in Figure 6. We can see that the average F1 changes smoothly with the variation of parameter λ 1 and there is quite a large stable region that we can select the optimal λ 1 from, demonstrating that our method is not sensitive to the parameter. 6 CONCLUSION AND DISCUSSION In this paper, we focus on the image classification task under non-i.i.d. situation. We argue that most previous correlation-based methods can only preserve their predictive power when the training and testing set are drawn from the same distribution but cannot generalize well under non-i.i.d. situation. Moreover, the results produced by those correlation-based methods can hardly be interpreted. To address the non-i.i.d. challenge, we introduce causality into image classification and propose a novel Causally Regularized Logistic Regression (CRLR) model to jointly optimize logistic loss and causal regularizer for causal classification on images. We construct a new dataset to simulate various non-i.i.d. situations in real world applications and conduct extensive experiments. The experimental results demonstrate that our CRLR algorithm outperforms the traditional correlationbased methods in various settings. We also demonstrate that the top causal features selected by our CRLR can provide explainable insights. In this paper, we omit the comparison with CNN-based image classifiers, as we cannot retrain a CNN model with only thousands of images in our dataset. But if we use the pre-trained CNN like AlexNet, it is seriously unfair because the representations in AlexNet is trained by millions of images. As we know that these complicated deep models are indeed overfitting the data. In this sense, they are also sensitive to the context bias and label composition bias. We will leave the empirical study on this problem to the future work. REFERENCES Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov HITON: a novel Markov Blanket algorithm for optimal variable selection. In AMIA Annual Symposium Proceedings, Vol American Medical Informatics Association, 21. Susan Athey, Guido W Imbens, and Stefan Wager Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions. arxiv preprint arxiv: (2016). Peter C Austin An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research 46, 3 (2011), Heejung Bang and James M Robins Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), Herbert Bay, Tinne Tuytelaars, and Luc Van Gool Surf: Speeded up robust features. Computer vision ECCV 2006 (2006), Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman Power-law distributions in empirical data. SIAM review 51, 4 (2009), Corinna Cortes and Vladimir Vapnik Support-vector networks. Machine learning 20, 3 (1995), Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1. Prague, 1 2. Virgile Landeiro Dos Reis and Aron Culotta Using matched samples to estimate the effects of exercise on mental health from Twitter. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Stürmer, M Alan Brookhart, and Marie Davidian Doubly robust estimation of causal effects. American journal of epidemiology 173, 7 (2011), Jens Hainmueller Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis (2011), mpr025. John A Hartigan and Manchek A Wong Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), Jiayuan Huang, Alexander J Smola, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, et al Correcting sample selection bias by unlabeled data. Advances in neural information processing systems 19 (2007), 601. Daphne Koller and Mehran Sahami Toward optimal feature selection. Technical Report. Stanford InfoLab.

16 1:16 Z. Shen et al Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang Treatment effect estimation with data-driven variable decomposition. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI. Michael Lechner Earnings and employment effects of continuous gff-the-job training in east germany after unification. Journal of Business & Economic Statistics 17, 1 (1999), Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), Anqi Liu and Brian Ziebart Robust classification under sample selection bias. In Advances in Neural Information Processing Systems David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Schölkopf, and Léon Bottou Discovering Causal Signals in Images. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). Scott Menard Applied logistic regression analysis. Number 106. Sage. Neal Parikh, Stephen P Boyd, et al Proximal Algorithms. Foundations and Trends in optimization 1, 3 (2014), Jean-Philippe Pellet and André Elisseeff Using Markov blankets for causal structure learning. Journal of Machine Learning Research 9, Jul (2008), Paul R Rosenbaum and Donald B Rubin The central role of the propensity score in observational studies for causal effects. Biometrika (1983), Hidetoshi Shimodaira Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90, 2 (2000), Elizabeth A Stuart Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 25, 1 (2010), 1. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert MÞller Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8, May (2007), Wei Sun, Pengyuan Wang, Dawei Yin, Jian Yang, and Yi Chang Causal Inference via Sparse Additive Models with Application to Online Advertising.. In AAAI Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li Yfcc100m: The new data in multimedia research. Commun. ACM 59, 2 (2016), Robert Tibshirani Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), Lei Yu and Huan Liu Efficient feature selection via analysis of relevance and redundancy. Journal of machine learning research 5, Oct (2004), Bianca Zadrozny Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning. ACM, 114. José R Zubizarreta Stable weights that balance covariates for estimation with incomplete outcome data. J. Amer. Statist. Assoc. 110, 511 (2015),

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing Primal-dual Covariate Balance and Minimal Double Robustness via (Joint work with Daniel Percival) Department of Statistics, Stanford University JSM, August 9, 2015 Outline 1 2 3 1/18 Setting Rubin s causal

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Introduction to Three Paradigms in Machine Learning. Julien Mairal Introduction to Three Paradigms in Machine Learning Julien Mairal Inria Grenoble Yerevan, 208 Julien Mairal Introduction to Three Paradigms in Machine Learning /25 Optimization is central to machine learning

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Treatment Effect Estimation with Data-Driven Variable Decomposition

Treatment Effect Estimation with Data-Driven Variable Decomposition Treatment Effect Estimation with Data-Driven Variable Decomposition Kun Kuang 1,2, Peng Cui 1,2, Bo Li 3, Meng Jiang 4, Shiqiang Yang 1,2, Fei Wang 5 1 Tsinghua National Laboratory for Information Science

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models JMLR Workshop and Conference Proceedings 6:17 164 NIPS 28 workshop on causality Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University

More information

Deep Convolutional Neural Networks for Pairwise Causality

Deep Convolutional Neural Networks for Pairwise Causality Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, Delhi Tata Consultancy Services Ltd. {karamjit.singh,

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation Instance-based Domain Adaptation via Multi-clustering Logistic Approximation FENG U, Nanjing University of Science and Technology JIANFEI YU, Singapore Management University RUI IA, Nanjing University

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

arxiv: v1 [cs.lg] 3 Jan 2017

arxiv: v1 [cs.lg] 3 Jan 2017 Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, New-Delhi, India January 4, 2017 arxiv:1701.00597v1

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Bearing fault diagnosis based on EMD-KPCA and ELM

Bearing fault diagnosis based on EMD-KPCA and ELM Bearing fault diagnosis based on EMD-KPCA and ELM Zihan Chen, Hang Yuan 2 School of Reliability and Systems Engineering, Beihang University, Beijing 9, China Science and Technology on Reliability & Environmental

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects Thai T. Pham Stanford Graduate School of Business thaipham@stanford.edu May, 2016 Introduction Observations Many sequential

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Differentiable Fine-grained Quantization for Deep Neural Network Compression Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Lecture 13 Visual recognition

Lecture 13 Visual recognition Lecture 13 Visual recognition Announcements Silvio Savarese Lecture 13-20-Feb-14 Lecture 13 Visual recognition Object classification bag of words models Discriminative methods Generative methods Object

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

An overview of deep learning methods for genomics

An overview of deep learning methods for genomics An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

RaRE: Social Rank Regulated Large-scale Network Embedding

RaRE: Social Rank Regulated Large-scale Network Embedding RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Loss Functions and Optimization. Lecture 3-1

Loss Functions and Optimization. Lecture 3-1 Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models Kun Zhang Dept of Computer Science and HIIT University of Helsinki 14 Helsinki, Finland kun.zhang@cs.helsinki.fi Aapo Hyvärinen

More information

DEIM Forum 04 E-3 43 80 3 5 / DC 43 80 3 5 90065 6-6 43 80 3 5 E-mail: gs3007@s.inf.shizuoka.ac.jp, dgs538@s.inf.shizuoka.ac.jp, ishikawa-hiroshi@sd.tmu.ac.jp, yokoyama@inf.shizuoka.ac.jp Flickr Exif OpenLayers

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths for New Developments in Nonparametric and Semiparametric Statistics, Joint Statistical Meetings; Vancouver, BC,

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

38 1 Vol. 38, No ACTA AUTOMATICA SINICA January, Bag-of-phrases.. Image Representation Using Bag-of-phrases

38 1 Vol. 38, No ACTA AUTOMATICA SINICA January, Bag-of-phrases.. Image Representation Using Bag-of-phrases 38 1 Vol. 38, No. 1 2012 1 ACTA AUTOMATICA SINICA January, 2012 Bag-of-phrases 1, 2 1 1 1, Bag-of-words,,, Bag-of-words, Bag-of-phrases, Bag-of-words DOI,, Bag-of-words, Bag-of-phrases, SIFT 10.3724/SP.J.1004.2012.00046

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

Measuring Social Influence Without Bias

Measuring Social Influence Without Bias Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Support Vector Machine I

Support Vector Machine I Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Summary and discussion of The central role of the propensity score in observational studies for causal effects

Summary and discussion of The central role of the propensity score in observational studies for causal effects Summary and discussion of The central role of the propensity score in observational studies for causal effects Statistics Journal Club, 36-825 Jessica Chemali and Michael Vespe 1 Summary 1.1 Background

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Wavelet-based Salient Points with Scale Information for Classification

Wavelet-based Salient Points with Scale Information for Classification Wavelet-based Salient Points with Scale Information for Classification Alexandra Teynor and Hans Burkhardt Department of Computer Science, Albert-Ludwigs-Universität Freiburg, Germany {teynor, Hans.Burkhardt}@informatik.uni-freiburg.de

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector

More information

Domain adaptation for deep learning

Domain adaptation for deep learning What you saw is not what you get Domain adaptation for deep learning Kate Saenko Successes of Deep Learning in AI A Learning Advance in Artificial Intelligence Rivals Human Abilities Deep Learning for

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Characterization of Jet Charge at the LHC

Characterization of Jet Charge at the LHC Characterization of Jet Charge at the LHC Thomas Dylan Rueter, Krishna Soni Abstract The Large Hadron Collider (LHC) produces a staggering amount of data - about 30 petabytes annually. One of the largest

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion Tutorial on Methods for Interpreting and Understanding Deep Neural Networks W. Samek, G. Montavon, K.-R. Müller Part 3: Applications & Discussion ICASSP 2017 Tutorial W. Samek, G. Montavon & K.-R. Müller

More information

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data 2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data Liu Liu Instructor: Dr. Hairong Qi University of Tennessee, Knoxville lliu25@vols.utk.edu September 21, 2017 Liu Liu (UTK)

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Multi-Layer Boosting for Pattern Recognition

Multi-Layer Boosting for Pattern Recognition Multi-Layer Boosting for Pattern Recognition François Fleuret IDIAP Research Institute, Centre du Parc, P.O. Box 592 1920 Martigny, Switzerland fleuret@idiap.ch Abstract We extend the standard boosting

More information