The degree distribution

Similar documents
Akaike Information Criterion

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Regression, Ridge Regression, Lasso

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Power laws. Leonid E. Zhukov

Association studies and regression

Model comparison and selection

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Data Mining Stat 588

Simple Linear Regression

Chapter 17: Undirected Graphical Models

Simple Linear Regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Introduction to Simple Linear Regression

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

CAS MA575 Linear Models

Binary choice 3.3 Maximum likelihood estimation

MS-C1620 Statistical inference

Machine Learning Linear Regression. Prof. Matteo Matteucci

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Machine Learning for OR & FE

Worksheet Topic 1 Order of operations, combining like terms 2 Solving linear equations 3 Finding slope between two points 4 Solving linear equations

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

Sparse Linear Models (10/7/13)

How the mean changes depends on the other variable. Plots can show what s happening...

Stat 5101 Lecture Notes

Lecture : Probabilistic Machine Learning

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

The Parameters of Menzerath-Altmann Law in Genomes *

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Regression Models - Introduction

Linear Regression (9/11/13)

Chapter 9. Non-Parametric Density Function Estimation

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Linear Model Selection and Regularization

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

1. The Multivariate Classical Linear Regression Model

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014.

Chapter 9. Non-Parametric Density Function Estimation

Chapter R - Basic Algebra Operations (94 topics, no due date)

STAT Financial Time Series

Chapter 3: Regression Methods for Trends

Model Fitting. Jean Yves Le Boudec

Topics Covered in Math 115

Generalized Linear Models

Linear Regression In God we trust, all others bring data. William Edwards Deming

Simple and Multiple Linear Regression

COS513 LECTURE 8 STATISTICAL CONCEPTS

1. Simple Linear Regression

Linear Regression Models P8111

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

COWLEY COLLEGE & Area Vocational Technical School

Inferring the Causal Decomposition under the Presence of Deterministic Relations.

High-dimensional regression modeling

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Linear models and their mathematical foundations: Simple linear regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

10. Time series regression and forecasting

Course Name: MAT 135 Spring 2017 Master Course Code: N/A. ALEKS Course: Intermediate Algebra Instructor: Master Templates

[y i α βx i ] 2 (2) Q = i=1

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Finding patterns in nocturnal seabird flight-call behaviour

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Mathematical statistics

Lecture 6. Regression

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Math 494: Mathematical Statistics

Math 70 and 95 Master Syllabus Beginning and Intermediate Algebra, 5th ed., Miller, O Neill, Hyde with ALEKS.

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS. MAT 010 or placement on the COMPASS/CMAT

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Classification. Chapter Introduction. 6.2 The Bayes classifier

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS

1.3 Distance and Midpoint Formulas

This paper is not to be removed from the Examination Halls

Chapter 1. Linear Regression with One Predictor Variable

MODULE -4 BAYEIAN LEARNING

Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures

1 Least Squares Estimation - multiple regression.

Introduction to Scientific Modeling CS 365, Fall 2012 Power Laws and Scaling

1 Mechanistic and generative models of network structure

Empirical Market Microstructure Analysis (EMMA)

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Statistical Distribution Assumptions of General Linear Models

Algebra II. A2.1.1 Recognize and graph various types of functions, including polynomial, rational, and algebraic functions.

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

MULTIPLE REGRESSION METHODS

Lecture 6: September 19

Transcription:

Universitat Politècnica de Catalunya Version 0.5 Complex and Social Networks (2018-2019) Master in Innovation and Research in Informatics (MIRI)

Instructors Argimiro Arratia, argimiro@cs.upc.edu, http://www.cs.upc.edu/~argimiro/ Marta Arias, marias@cs.upc.edu, http://www.cs.upc.edu/~marias/ Please go to http://www.cs.upc.edu/~csn for all course s material, schedule, lab work, etc.

Outline

The limits of visual analysis A syntactic dependency network [Ferrer-i Cancho et al., 2004]

The empirical degree distribution N: finite number of vertices k: vertex degree n(k): number of vertices of degree k. n(0),n(1),n(2),...,n(n) defines the degree spectrum (loops are allowed). n(k)/n: the proportion of vertices of degree k, which defines the (empirical) degree distribution. p(k): function giving the probability that a vertex has degree k, p(k) n(k)/n. p(k): probability mass function (pmf).

Example: degree spectrum Global syntactic dependency network (English) Nodes: words Links: syntactic dependencies Not as simple: Many degrees occurring just once! Initial bending or hump: power-law?

Example: empirical degree distribution Notice the scale of the y-axis. Normalized version of the degree spectrum (dividing over N).

Example: in-degree (red) degree versus out-degree (green) The distribution of in-degree and that of out-degree need not be identical! Similar for global syntactic dependency networks? Differences in the distribution or the parameters? Known cases of radical differences between in and out-degree distributions (e.g., web pages, wikipedia articles). In-degree more power-law like than out degree.

What is the mathematical form of p(k)? Possible degree distributions The typical hypothesis: a power-law p(k) = ck γ but what exactly? How many free parameters? Zeta distribution: 1 free parameter. Right-truncated zeta distribution: 2 free parameters.... Motivation: Accurate data description (looks are deceiving). Help to design or select dynamical models.

Zeta distributions I Zeta distribution: being the Riemann zeta function. p(k) = 1 ζ(γ) k γ, ζ(γ) = x=1 x γ (here it is assumed that γ is real) ζ(γ) converges only for γ > 1 (γ > 1 is needed). γ is the only free parameter! Do we wish p(k) > 0 for k > N?

Zeta distributions II Right-truncated zeta distribution being p(k) = 1 H(k max, γ) k γ, k max H(k max, γ) = x=1 x γ the generalized harmonic number of order k max of γ. Or why not p(k) = ck γ e kβ (modified power-law, Altmann distribution,...) with 2 or 3 free parameters? Which one is best? (standard model selection)

What is the mathematical form of p(k)? Possible degree distributions The null hypothesis (for a Erdös-Rényi graph without loops) ( ) N 1 p(k) = π k (1 π) N 1 k k with π as the only free parameter (assuming that N is given by the real network). Binomial distribution with parameters N 1 and π, thus k = (N 1)π Nπ. Another null hypothesis: random pairing of vertices with constant number of edges E.

The problems II Is f (k), a good candidate? Does f (k) fit the empirical degree distribution well enough? f (k) is a (candidate) model. How do we evaluate goodness of a model? Three major approaches: Qualitatively (visually). The error of the model: the deviation between the model and the data. The likelihood of the model: the probability that the model produces the data.

Assume a two variables: a predictor x (e.g., k, vertex degree) and a response y (e.g., n(k), the number vertices of degree k; or p(k)...). Look for a transformation of the at least one of the variables showing approximately a straight line (upon visual inspection) and obtain the dependency between the two original variables. Typical transformations: x = log(x), y = log(y). 1. If y = log(y) = ax + b (linear-log scale) then y = e ax+b = ce ax, with c = e b (exponential). 2. If y = log(y) = ax + b = alog(x) + b (log-log scale) then y = e alog(x)+b = cx a, with c = e b (power-law). 3. If y = ax + b = alog(x) + b (log-linear scale) then the transformation is exactly the functional dependency between the original variables (logarithmic).

What is this distribution?

Solution: geometric distribution y = (1 p) x 1 p (with p = 1/2 in this case). In standard exponential form, y = (1 p) x p 1 p = ex log(1 p) p 1 p = ce ax with a = log(1 p) and c = p/(1 p). Examples: Random network models (degree is geometrically distributed). Distribution of word lengths in random typing (empty words are not allowed) [Miller, 1957]. Distribution of projection lengths in real neural networks [Ercsey-Ravasz et al., 2013].

A power-law distribution What is the exponent of the power-law?

Solution: zeta distribution y = 1 ζ(a) x a with a = 2. Formula for ζ(a) is known for certain integer values, e.g., ζ(2) = π 2 /6 1.645. Examples: Empirical degree distribution of global syntactic dependency networks [Ferrer-i Cancho et al., 2004] (but see also lab session on degree distributions). Frequency spectrum of words in texts [Corral et al., 2015].

What is this distribution?

Solution: a logarithmic distribution y = c(log(x max ) log x)) with x = 1, 2,..., x max and c being a normalization term, i.e.. c = 1 xmax x=1 (log(x max) log x))

The problems of visual fitting The right transformation to show linearity might not be obvious (taking logs is just one possibility). Looks can be deceiving with noisy data. A good guess or strong support for the hypothesis requires various decades. Solution: a quantitative approach.

I [Ritz and Streibig, 2008] A univariate response y. A predictor variable x Goal: functional dependency between y and x. Formally: y = f (x, β), where f (x, β) is the model. K parameters. β = (β 1,..., β K ) Examples: Linear model: f (x, (a, b)) = ax + b (K = 2). A non-linear model (power-law): f (x, (a, b)) = ax b (K = 2).

II Problem of regression: A data set of n pairs: (x 1, y 1 ),..., (x n, y n ). Example: x i is vertex degree (k) and y i is the number of vertices of degree k (n(k)) of a real network. n is the sample size. f (x, β) is unlikely to give a perfect fit. y 1, y 2,..., y n may contain error. Solution: the conditional mean response E(y i x i ) = f (x i, β) (f (x, β) is not actually the model for the data points but a model for expectation given x i ).

II The full model is then y i = E(y i x i ) + ɛ i = f (x i, β) + ɛ The quality of the fit of a model with certain parameters: the residual sums of squares n RSS(β) = (y i f (x i, β)) 2 i=1 The parameters of the model are estimated minimizing the RSS. : minimization of RSS. Common metric of the quality of the fit: the residual standard error s 2 = RSS(β) n K

Example of non-linear regression yields y = 2273.8x 1.23 (is the exponent that low?) Is the method robust? (=not distracted by undersampling, noise, and so on) Likely and unlikely events are weighted equally. Solution: weighted regression, taking likelihood into account,...

I [Burnham and Anderson, 2003] A probabilistic metric of the quality of the fit. L(parameters data, model): likelihood of the parameters given the data (sample of size n) and a model. Example: L(γ data, Zeta distribution with parameterγ) Best parameters: the parameters that maximize L(parameters data, model).

II Consider a sample x 1, x 2,...x n (e.g., the degree sequence of a network). Definition (assuming independence) L(parameters data, model) = i=1 p(x i ; parameters) For a zeta distribution L(γ x 1, x 2,.., x n ; Zeta distribution) = n p(x i ; γ) i=1 = ζ(γ) n n i=1 x γ i

Log-likelihood is a vanishingly small number. Solution: taking logs. L(parameters data, model) = log L(parameters data, model) = log p(x i ; parameters) i=1 Example: n L(γ x 1, x 2,.., x n ; Zeta distribution) = log p(x i ; γ) = γ i=1 n log x i n log(ζ(γ)) i=1

Question to the audience What is the best model for data? Cue: a universal method.

What is the best model for data? The best model of the data is the data itself. Overfitting! The quality of the fit cannot decrease if more parameters are added (wisely). Indeed, the quality of the fit normally increases when adding parameters. The metaphor of picture compression. Compressing a picture (with quality reduction). A good compression technique shows a nice trade-off between file size and image quality). Modelling is compressing a sample, the empirical distribution (e.g., compressing the degree sequence of a network). Models with many parameters should be penalized! Models compressing the data with a low quality should be also penalized. How?

Akaike s information criterion (AIC) AIC = 2L + 2K, with K being the number of parameters of the model. For small samples, a correction is necessary ( ) n AIC c = 2L + 2K, n K 1 or equivalently 2K(K + 1) AIC c = 2L + 2K + ( n K ) 1 2K(K + 1) = AIC + n K 1 AIC c is recommended if n K is not satisfied!

Model selection with AIC What is the best of a set of models? The model that minimizes AIC AIC best : the AIC of the model with smallest AIC. : AIC difference, the difference between the AIC of the model and that of the best model ( = 0 for the best model).

Example of model selection with AIC Consider the case of model selection with three nested models: Model 1 p(k) = k 2 ζ(2) Model 2 p(k) = k γ ζ(γ) Model 3 p(k) = k γ H(k max,γ) (zeta distribution with (-)2 exponent) (zeta distribution) (right-truncated zeta distribution) Model i is nested model of i 1 if the model i is a generalization of model i 1 (adding at least one parameter).

Example of model selection with AIC Model K L AIC 1 0......... 2 1......... 3 2......... Imagine that the true model is a zeta distribution with γ = 1.5 and the sample is large enough, then Model K L AIC 1 0...... 0 2 1...... 0 3 2...... > 0

AIC for non-linear regression I RSS: distance between the data and fitted regression curve based on the the model fit. AIC: estimate of the distance from the model fit to the true but unknown model that generated the data. In a regression model one assumes that the error ɛ follows a normal distribution, the p.d.f. is } 1 (ɛ µ)2 f (ɛ) = (2πσ 2 exp { ) 1/2 2σ 2 The only parameter is σ as standard non-linear regression assumes µ = 0.

AIC for non-linear regression II Applying µ = 0 and ɛ i = y i f (x i, β) 1 f (ɛ i ) = (2πσ 2 exp { (y i f (x i, β)) 2 } ) 1/2 2σ 2 in a regression model: L(β, σ 2 ) = After some algebra one gets L(β, σ 2 ) = n f (ɛ i ) i=1 1 (2πσ 2 exp ) n/2 { RSS(β) } 2σ 2.

AIC for non-linear regression III Equivalence between maximization of likelihood and minimization of error (under certain assumptions) If ˆβ is the best estimate of β then L( ˆβ, ˆσ 2 ) = 1 exp( n/2) (2πRSS( ˆβ)/n) n/2 with ˆσ = n K n s2 (recall s 2 = RSS(β) n K ). Models selection with regression models: AIC = 2 log L( ˆβ, ˆσ 2 )) + 2(K + 1) = n log(2π) + n log(rss( ˆβ/n) + n + 2(K + 1) Why is the term for parsimony 2(K + 1) and not K?

Concluding remarks Under non-linear regression AIC is the way to go for model selection if the models are not nested (alternative methods do exist for nested models [Ritz and Streibig, 2008]). Equivalence between maximum likelihood and non-linear regression implies some assumption (e.g., homocedasticity).

References I Burnham, K. P. and Anderson, D. R. (2003). Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media. Corral, A., Boleda, G., and Ferrer-i Cancho, R. (2015). Zipfs law for word frequencies: Word forms versus lemmas in long texts. PloS one, 10(7):e0129031.

References II Ercsey-Ravasz, M., Markov, N. T., Lamy, C., Van Essen, D. C., Knoblauch, K., Toroczkai, Z., and Kennedy, H. (2013). A predictive network model of cerebral cortical connectivity based on a distance rule. Neuron, 80(1):184 197. Ferrer-i Cancho, R., Solé, R. V., and Köhler, R. (2004). Patterns in syntactic dependency networks. Physical Review E, 69(5):051915. Miller, G. A. (1957). Some effects of intermittent silence. The American journal of psychology, 70(2):311 314.

References III Ritz, C. and Streibig, J. C. (2008). Nonlinear regression with R. Springer Science & Business Media.