KU Leuven, ECOOM & Dept MSI, Leuven (Belgium)

Size: px
Start display at page:

Download "KU Leuven, ECOOM & Dept MSI, Leuven (Belgium)"

Transcription

1 KU Leuven, ECOOM & Dept MSI, Leuven (Belgium)

2 Structure... Basic postulates in bibliometrics.... On probability distributions and stochastic processes A model for scientific productivity A model for citation processes... Probability distributions and stochastic processes Asymptotic normality of scientometric indicators... The h-index The tail parameter. /

3 Foreword The mathematical context of scientometrics/bibliometrics In this session we will focus on the mathematical foundation of and on models for fundamental bibliometrics processes, such as publication and citation processes. Several deterministic and stochastic models will be introduced and discussed in the context of building and interpreting indicators as (statistical) functions derived from the model. Special a ention is paid to the high end of performance represented by the heavy tails of the underlying distributions. Communication networks and their graph and vector-space representation, which form the mathematical groundwork for structural analysis, require more time and space, and are therefore omi ed here. Part of the theory is presented in the following lecture. /

4 Foreword The mathematical context of scientometrics/bibliometrics In this session we will focus on the mathematical foundation of and on models for fundamental bibliometrics processes, such as publication and citation processes. Several deterministic and stochastic models will be introduced and discussed in the context of building and interpreting indicators as (statistical) functions derived from the model. Special a ention is paid to the high end of performance represented by the heavy tails of the underlying distributions. Communication networks and their graph and vector-space representation, which form the mathematical groundwork for structural analysis, require more time and space, and are therefore omi ed here. Part of the theory is presented in the following lecture. /

5 Introduction The mathematical context of scientometrics/bibliometrics Whenever a discipline reaches a stage that requires the support of statistical methods, a metrics emerges from this field. Classical examples are biometrics, psychometrics and econometrics. In this context P ( ) explained the term bibliometrics as the application of mathematical and statistical methods to books and other media of communication. /

6 Introduction The application of mathematical models, notably of stochastic models has the following important advantages.. It provides mathematical interpretations beside the scientometric ones (even beyond the particular field). Mathematical interpretation of scientometric measures can be given by parameters and statistical functions.. Although deterministic models also allow randomness, the use of probabilistic models opens new perspectives, above all, concerning inference. /

7 Introduction The deterministic approach is the easiest way to process simple counts of raw data and measurements to indicators. Mostly elementary mathematical operations (e.g., shares, averages, ratios) are used. The interpretation of more complex measures and constructs (composite indicators) becomes more problematic. In social processes, observations are subject to a variety of influences that are partially not directly measurable. The complexity of social interactions itself yields the effect that is usually interpreted as randomness. In bibliometrics, the events seem to be random as they are conditioned by a plethora of superposing actions, processes and effects. Communication, mobility, collaboration, publications and citations all are subject to these effects. G M, Thoughts and facts on bibliometric indicators, /

8 Basic postulates in bibliometrics Basic postulates help properly define bibliometrics measures. A paper receives at one time at most one citation. An author publishes only one article at one time. A paper does not receive citations prior to it its publication. The citation link between two papers is unique. The first two postulates are necessary to define mappings and allow the application of point processes. Simultaneousness of publications and citations can be solved by using an infinitesimal number ε > for publications or citations, e.g., in the same issue of a journal at time t, t + ε, t + ε, etc. The fourth postulate is straightforward and the third one is necessary to make quantification of citation impact possible. /

9 Basic postulates in bibliometrics The publication process from the bibliometric viewpoint (at time t and in the period T = [s, t] for some time s < t) Authors F t Papers Co-authors Authors F T Papers Co-authors t 1 t 3 t 2 /

10 Basic postulates in bibliometrics The citation process from the bibliometric viewpoint (at time t and in the period T = [s, t] for some time s < t) Source papers Citing papers Y t References Source papers Citing papers Y T References t 1 t 3 t 2 /

11 Basic postulates in bibliometrics In the first case, the mapping F t (for each time t being the publication date) describes an authorship. The period T = [s, t] is called publication period. F T then defines the publication process of an author. The complete origin Ft of a given publications is the set of its co-authors: If we denote the set of authors by A and that of papers by B, then we have for each element b B : Ft = {a A : F t (a) = b}. In the second case, the mapping Y t (for each time t being the publication date of citing paper) describes a citation. The period T = [s, t] is called citation window. Y T then forms the citation process of a paper. The complete origin Yt of a given publications is the set of its references: If we denote the set of source papers by A and that of citing papers by B, then we have for each element b B : Yt = {a A : Y t (a) = b}.

12 Models for the dissemination of scientific information Probability distributions A random variable (X) is a function defined on a sample space. Then its probability distribution is a function that assigns certain values (x) to the r. v. (X) describing its probability: P(X x) [, ]. Two basic types of distributions: Discrete distributions (taking values in a discrete set, e.g. non-negative integer values) Continuous distributions (taking values in a continuous set, e.g. real values) Probability distributions are used to model, for instance, the publication activity of an author, the citation impact of a paper or the number of co-authors of a paper.

13 Models for the dissemination of scientific information Probability distributions A random variable (X) is a function defined on a sample space. Then its probability distribution is a function that assigns certain values (x) to the r. v. (X) describing its probability: P(X x) [, ]. Two basic types of distributions: Discrete distributions (taking values in a discrete set, e.g. non-negative integer values) Continuous distributions (taking values in a continuous set, e.g. real values) Probability distributions are used to model, for instance, the publication activity of an author, the citation impact of a paper or the number of co-authors of a paper.

14 Probability distributions Example of discrete distribution (le ) and the density function of a continuous distributions (right) 0.30 Poisson distribution 0.50 Standard normal distribution 0.25 P(X=k)=e - k /k! ( =2) k x

15 Probability in bibliometrics Bibliometric distributions are typically skewed and have long tails. Furthermore, most distributions (publication activity, citation rates, co-authorship) are integer-valued. Below an example of a skewed, long-tailed and integer-valued distribution is given. Empirical distribution of citations a scientist received on this work f k >20 k

16 Stochastic processes In probability theory, stochastic processes are families (i.e., collections) of random variables {X t t T} defined on a given probability space, indexed by a totally ordered set T. The la er set T, which is considered to represent time might be discrete or continuous. Distinction has to be made between a stochastic process and a time series. While a stochastic process is a well-structured mapping describing evolution, a time series is a sequence of data that are measured at different time instants. Examples: Stochastic process: The cumulative number of papers an author has published in his/her career The cumulative number of citations a paper has received since its publication Time series: The annual impact factor of a journal.

17 Stochastic processes In probability theory, stochastic processes are families (i.e., collections) of random variables {X t t T} defined on a given probability space, indexed by a totally ordered set T. The la er set T, which is considered to represent time might be discrete or continuous. Distinction has to be made between a stochastic process and a time series. While a stochastic process is a well-structured mapping describing evolution, a time series is a sequence of data that are measured at different time instants. Examples: Stochastic process: The cumulative number of papers an author has published in his/her career The cumulative number of citations a paper has received since its publication Time series: The annual impact factor of a journal.

18 Stochastic processes In probability theory, stochastic processes are families (i.e., collections) of random variables {X t t T} defined on a given probability space, indexed by a totally ordered set T. The la er set T, which is considered to represent time might be discrete or continuous. Distinction has to be made between a stochastic process and a time series. While a stochastic process is a well-structured mapping describing evolution, a time series is a sequence of data that are measured at different time instants. Examples: Stochastic process: The cumulative number of papers an author has published in his/her career The cumulative number of citations a paper has received since its publication Time series: The annual impact factor of a journal.

19 Models for the dissemination of scientific information A model for scientific productivity In order to facilitate comprehensibility and interpretation, the publication process is first introduced using a deterministic model. Three groups of individuals are assumed.. Those who entering the system,. Those who are staying in the system and. Those who are leaving it. The system is described as follows. S G, Scientometrics,

20 Waring process Consider an infinite array of units indexed in succession by the non-negative integers. The content of the i-th unit is denoted by x i, the (finite) content of all units by x. y i = x i /x (i ) expresses the share of elements contained by the i-th cell. The change of content is postulated to obey the following rules. (i) Substance may enter the system from the external environment through -th unit at a rate s. (ii) Substance may be transferred unidirectionally from the i-th unit to the (i + )-th one at a rate f i (i N ). (iii) Substance may leak out from the i-th unit into the external environment at a rate g i (i N ).

21 Waring process Scheme of substance flow in the Waring process We interpret the above ratios y i as the (classical) probability with which an element is contained by the i-th unit. The stochastic process is then formed by the change of the content of the units, i.e., by the change of papers published by the authors who have entered the system. X(t) denotes the (random) number of published papers, P(X(t) = i) = y i the probability that an author in the system has published i papers in the period (, t).

22 Waring process Scheme of substance flow in the Waring process We interpret the above ratios y i as the (classical) probability with which an element is contained by the i-th unit. The stochastic process is then formed by the change of the content of the units, i.e., by the change of papers published by the authors who have entered the system. X(t) denotes the (random) number of published papers, P(X(t) = i) = y i the probability that an author in the system has published i papers in the period (, t).

23 Waring process Using the above notations, we can give a mathematical formulation for the equations of change in the system. x (t) = s(t) f (t) g (t). x i (t) = f i (t) f i (t) g i (t), (i > ) ( ) The following particular forms of the above rate terms are used: s = σ x f i = (a + b i) x i, (i ) g i = γ x i ; (i ) ( ) where σ, a, b and γ are non-negative real values.

24 Waring process Eqs ( ) and ( ) result in x (t) = Σx i = (s Σg i ) = (σ γ) x y (t) = σ (a + σ) y. y i (t) = (a + b (i )) y i (a + b i + σ) y i, (i > ) ( ) with the initial conditions y i ( ) = { if i = otherwise. We obtain x(t) = x( ) e (σ γ) t, i.e., the system is asymptotically time-invariant (stationary) only if σ = γ, otherwise, if σ > γ or σ < γ, it exponentially grows or decays, respectively.

25 Waring process The general solution of the above system of differential equations, which is independent of γ, reads y i (t) = i b ij e (a+b i+σ)t + σ a + σ j= i j= a + b (j ) a + b j + σ. As t tends to infinity, a Waring distribution (red expression) with parameters N = a/b and α = σ/b is obtained, while the blue expression vanishes. The following three special cases should be mentioned.. α = : Price distribution. N = : Yule distribution. b = : Geometric distribution

26 Waring process The general solution of the above system of differential equations, which is independent of γ, reads y i (t) = i b ij e (a+b i+σ)t + σ a + σ j= i j= a + b (j ) a + b j + σ. As t tends to infinity, a Waring distribution (red expression) with parameters N = a/b and α = σ/b is obtained, while the blue expression vanishes. The following three special cases should be mentioned.. α = : Price distribution. N = : Yule distribution. b = : Geometric distribution

27 Waring process The general solution of the above system of differential equations, which is independent of γ, reads y i (t) = i b ij e (a+b i+σ)t + σ a + σ j= i j= a + b (j ) a + b j + σ. As t tends to infinity, a Waring distribution (red expression) with parameters N = a/b and α = σ/b is obtained, while the blue expression vanishes. The following three special cases should be mentioned.. α = : Price distribution. N = : Yule distribution. b = : Geometric distribution

28 Waring process Derrek de Solla Price distinguished the following categories of authors: newcomers, continuants, transients and terminators. P G, International Forum on Information and Documentation, Within the framework of this model, s represents the group of newcomers, g i terminators, g transients and Σf i represents the group of continuants. Further developements and applications: P, JASIS, X, Utilitas Mathematica, X, Journal of the Royal Statistical Society, Series A, G S, Scientometrics, B., Drug Metabolism Reviews, S., Scientometrics, B, Scientometrics,

29 Waring process Example: Distribution of authors by publication productivity Part k Number of authors Obs Calc Obs Calc Obs Calc Obs Calc Obs Calc Obs Calc n p KS N ˆα Legend: k: number of papers, n: total number of authors, p: total number of papers, KS: K-S statistics Data source: S G ( ); P J ( )

30 Waring process Example: Distribution of authors by publication productivity Part k Number of authors Obs Calc Obs Calc Obs Calc Obs Calc Obs Calc Obs Calc n n KS N ˆα Legend: k: number of papers, n: total number of authors, p: total number of papers, KS: K-S statistics Data source: S G ( ); P J ( )

31 Waring process Population growth (T d. a) of authors according to the example by S G ( )

32 The Price distribution DeSolla Price square root law states that half of the scientific papers are contributed by the top square root of the total number of scientific authors. Versions of the Waring, Zipf and Lotka distribution have studied for satisfying this law. G S, Scientometrics, E R, Journal of Information Science, E, Scientometrics,

33 The Price distribution Fit of the Price (N =. ) and Lotka (α = ) distribution to the original Lotka data k Observed Calculated frequency frequency Lotka Price >

34 A Negative Binomial Process Consider now the same array of units as in the case of publication activity. We now assume that the system is isolated from external influences, i.e., no substance enters or leaves the system. Thus σ = γ = and only rule (ii) of the preceding paragraph remains valid and takes the following form. f i = (a + bi) x i ; a(t)/b(t) = N = const > (ii ) Hence x(t) = x( ) follows, where x( ) > is assumed. The special assumption x( ) = does not mean any restriction of generality. The system is described as follows. G S, Scientometrics, G S, Information Processing & Management,

35 A Negative Binomial Process Scheme of substance flow in the non-homogeneous birth process Let X(t) denote the number of citations received, P(X(t) = k) = y k the probability that paper has been cited k times till time t. With the above transition rules, we can write y i (t) = {(N + i ) y i (N + i) y i }b(t), (i > ) with the same initial conditions as in the case of the Waring process.

36 A Negative Binomial Process Scheme of substance flow in the non-homogeneous birth process Let X(t) denote the number of citations received, P(X(t) = k) = y k the probability that paper has been cited k times till time t. With the above transition rules, we can write y i (t) = {(N + i ) y i (N + i) y i }b(t), (i > ) with the same initial conditions as in the case of the Waring process.

37 A Negative Binomial Process Hence we obtain by successive integration the following solutions. ( ) N + k P(X(t) = k) = y k (t) = e r(t)n ( e r(t) ) k ; r(t) = k t b(u) du The probability that at time t the substance is in the k-th unit, provided it was in the i-th one at time s t, is denoted by p ik (s, t) (k i). Now the initial conditions are { if k = i p ik (s, s) = otherwise.

38 A Negative Binomial Process Hence we obtain by successive integration the following solutions. ( ) N + k P(X(t) = k) = y k (t) = e r(t)n ( e r(t) ) k ; r(t) = k t b(u) du The probability that at time t the substance is in the k-th unit, provided it was in the i-th one at time s t, is denoted by p ik (s, t) (k i). Now the initial conditions are { if k = i p ik (s, s) = otherwise.

39 A Negative Binomial Process The second system of differential equations p ik (s, t)/ t = {(N + k )p i,k (s, t) (N + k)p ik (s, t)} b(t) ; k > i results in the following solution. P(X(t) X(s) = j X (s) = i) = ( ) N + i + j = e (r(t) r(s))(n+i) ( e (r(t) r(s)) ) j, k The following two special cases worth mentioning.. Geometric distribution (N = ). Poisson distribution (N, q ; N(q ) c < )

40 A Negative Binomial Process The second system of differential equations p ik (s, t)/ t = {(N + k )p i,k (s, t) (N + k)p ik (s, t)} b(t) ; k > i results in the following solution. P(X(t) X(s) = j X (s) = i) = ( ) N + i + j = e (r(t) r(s))(n+i) ( e (r(t) r(s)) ) j, k The following two special cases worth mentioning.. Geometric distribution (N = ). Poisson distribution (N, q ; N(q ) c < )

41 A Negative Binomial Process The mean value function of the process is defined as the regression function of X(s, t) on X(s) = i. M i (s, t) = E(X(t) X(s) X(s) = i) = (N + i)(e r(t) r(s) ) ; i, t s This function will play an important role in the applications. Under the above conditions we have M(s, t) = E(X(t) X(s)) = N(e r(t) e r(s) ) The la er equation reflects the non-homogeneity of the process, i.e., for example, M(s, s + h) M(t, t + h) if s t (h > ). Non-homogeneity is an important property of citation processes.

42 A Negative Binomial Process Two examples for different ageing types of scientific literature a) fast (E(X) = ) and b) extremely slow (E(X) = ) maturing decline Year after publication (x) E(X) = 4 E(X) = 9

43 Stopping times of citation processes So far we have considered the change of the number of citations over time. In order to provide additional information about the succession of citations a paper receives during a certain time span [, t n ], we develop an appropriate model for the reception speed. Let T i denote the shortest period during which the paper has received i citations: T i = min{t n : X(t n ) i}. Random variables of this type are called stopping times. Three important properties can be derived from the definition: (i) P(T i = t n ) = P(X(t n ) i) P(X(t n ) i) (ii) P(T i = ) (iii) P(T i = t ) = P(X(t ) i)

44 Stopping times of citation processes Some conclusions from these properties: (ii) P(T i = ) > if the paper does not receive i citations, specifically, P(T = ) > if the paper is never cited. (iii) P(T i = t ) = if i =. Some further developements and applications: G, Information Processing & Management, R, Scientometrics, B, Scientometrics, E, Mathematical and Computer Modelling,

45 Probability and statistics Pólya s urn model A general model for probability distributions and stochastic processes with implications for scientometrics Assume an urn that contains a number of w white and b black balls, where w and b are positive integers. Balls are drawn at random and then returned together with s balls of the same colour, where s is an integer. If s is negative, balls of the same colours are removed (in the case of s = the drawn ball is not placed back), while s = means that no additional balls are placed back. Two basic situations are obtained resulting in two specific random variables.. The number k of white balls a er n draws.. The number k of draws till a given number n of black balls is drawn.

46 Pólya s urn model The first type of distributions is always finite. Examples: s = : binomial s < : hypergeometric Examples for the second type: s = : negative binomial (k = : geometric) s > : inverse Pólya-Eggenberger (k = : Waring). The urn model can also be used to model some stochastic processes. J K, Distributions in Statistics: Discrete distributions, J K, Urn models and their application, Advantage of the model: The self-reinforcing property reflects the cumulative advantage or succes-breeds-success phenomenon. P, JASIS,

47 Probability distributions Examples of differently shaped probability distributions obtained from the urn model (le : no tail, right: heavy tail) Geometric Neg. binomial 0.4 Waring Inv. Póly-Eggenberger f k 0.3 f k k k

48 Asymptotic normality of scientometric indicators Authorship, publication activities, references, citations and other links can be expressed by random variables. Nevertheless, the application of most mathematical-statistical methods to scientometrics are approximate solutions.. One reason is ambiguity and uncertainty (cf. Bookstein, ). Another reason: Normality is the basis and the condition of many statistical tests. Most scientometric distributions are discrete and extremely skewed, i.e., not normal, but most indicators are approximately normally distributed.

49 Asymptotic normality of scientometric indicators Authorship, publication activities, references, citations and other links can be expressed by random variables. Nevertheless, the application of most mathematical-statistical methods to scientometrics are approximate solutions.. One reason is ambiguity and uncertainty (cf. Bookstein, ). Another reason: Normality is the basis and the condition of many statistical tests. Most scientometric distributions are discrete and extremely skewed, i.e., not normal, but most indicators are approximately normally distributed.

50 Asymptotic normality of scientometric indicators Authorship, publication activities, references, citations and other links can be expressed by random variables. Nevertheless, the application of most mathematical-statistical methods to scientometrics are approximate solutions.. One reason is ambiguity and uncertainty (cf. Bookstein, ). Another reason: Normality is the basis and the condition of many statistical tests. Most scientometric distributions are discrete and extremely skewed, i.e., not normal, but most indicators are approximately normally distributed.

51 Asymptotic normality of scientometric indicators Authorship, publication activities, references, citations and other links can be expressed by random variables. Nevertheless, the application of most mathematical-statistical methods to scientometrics are approximate solutions.. One reason is ambiguity and uncertainty (cf. Bookstein, ). Another reason: Normality is the basis and the condition of many statistical tests. Most scientometric distributions are discrete and extremely skewed, i.e., not normal, but most indicators are approximately normally distributed.

52 Asymptotic normality of scientometric indicators Theorem (Central limit theorem) Let X, X,..., X n be a sequence of n independent and identically distributed random variables with finite expectation μ and variance σ >. Then the distribution of the random variable Z n = ( n i= X i nμ)/σ n converges weakly to the standard normal distribution N (, ) as n tends to infinity. Remark Under certain conditions (e.g., Lindeberg or Lyapunov condition), a weaker form of the central limit theorem, where identical distribution is not required, holds.

53 Asymptotic normality of scientometric indicators Theorem (Central limit theorem) Let X, X,..., X n be a sequence of n independent and identically distributed random variables with finite expectation μ and variance σ >. Then the distribution of the random variable Z n = ( n i= X i nμ)/σ n converges weakly to the standard normal distribution N (, ) as n tends to infinity. Remark Under certain conditions (e.g., Lindeberg or Lyapunov condition), a weaker form of the central limit theorem, where identical distribution is not required, holds.

54 Central limit theorem Remark As a consequence, the sample mean of random variables x = X i /n with any distribution belonging to the a raction domain of the normal distribution is approximately normally distributed. We have E( x) = μ and D( x) = σ/ n, where μ and σ are the expectation and standard deviation of the common distribution. According to Glivenko s theorem, the empirical distribution converges to the underlying theoretical one with probability. The relative frequency f is an unbiased and consistent estimator of the corresponding probability p. In particular, we have E(f ) = p, where p is the probability that a paper is not cited and D(f ) = p ( p )/n.

55 Central limit theorem Remark As a consequence, the sample mean of random variables x = X i /n with any distribution belonging to the a raction domain of the normal distribution is approximately normally distributed. We have E( x) = μ and D( x) = σ/ n, where μ and σ are the expectation and standard deviation of the common distribution. According to Glivenko s theorem, the empirical distribution converges to the underlying theoretical one with probability. The relative frequency f is an unbiased and consistent estimator of the corresponding probability p. In particular, we have E(f ) = p, where p is the probability that a paper is not cited and D(f ) = p ( p )/n.

56 Central limit theorem Remark As a consequence, the sample mean of random variables x = X i /n with any distribution belonging to the a raction domain of the normal distribution is approximately normally distributed. We have E( x) = μ and D( x) = σ/ n, where μ and σ are the expectation and standard deviation of the common distribution. According to Glivenko s theorem, the empirical distribution converges to the underlying theoretical one with probability. The relative frequency f is an unbiased and consistent estimator of the corresponding probability p. In particular, we have E(f ) = p, where p is the probability that a paper is not cited and D(f ) = p ( p )/n.

57 Approximate normality of means and shares Sample means and shares of uncited papers ( % of Belgian publications in with year citation window) k n f 0 k n f 0 k n f % % % % % % % % % % % % % % % % % % % % Total % G M, STI, ; Data source: Thomson Reuters Web of Knowledge

58 Approximate normality of means and shares Truncated moments for sample means and shares of uncited papers ( % of Belgian publications in with year citation window) D x y = 5.73x R² = d x D x y = 0.218x R² = d x G M, STI, ; Data source: Thomson Reuters Web of Knowledge Plot based on a characterisation theorem for the normal distribution by Glänzel (, )

59 The tail of scientometric distributions Let X be a random variable, in the present case X the citation rate of a paper. The probability mass function of the non-negative integer valued r.v. X is denoted by p k = P(X = k) for each k, the distribution function by F k = P(X < k). Furthermore we put G k := F k = P(X k). Consider now a given sample {X i } i=,...,n of size n. Assume that all elements are independent and identically distributed with F being the common distribution. Further assume that the sample elements X i are ranked in decreasing order X X..., X i... X n. Although this can be readily obtained from an ordinary ordered sample by replacing index i by (n i + ) for all i =,..., n, we will use the terms rank statistics of a statistical sample or simply ranked sample.

60 Ranked samples The easiest way to calculate theoretical values for rank statistics is using Gumbel s r-th characteristic extreme value (u r ). G, Statistics of extremes, u r := G ( r { n ) = max k : G k r }, n n is a the size of a given sample with distribution F. X r can be considered an estimator of the corresponding r-th characteristic extreme value u r.

61 The h-index Example of Gumbel s characteristic extreme values Waring distribution with N = a =2 and n = n G k u 10 u 9 u 8 u 7 u 6 u u 4 u u 2 u k

62 The h-index Jorge E. Hirsch introduced a new indicator called h-index for the assessment of the research performance of individual scientists. A scientist has index h if h of his or her N p papers have at least h citations each and the other (N p h) papers have h citations each. H, PNAS US, The papers meeting this criterion are also referred to as Hirsch core publications. An alternative index was introduced by Leo Egghe. A set of papers has a g-index g if g is the highest rank such that the top g papers have, together, at least g citations. E, Scientometrics, Alternative definition: g is the highest rank such that the top g papers have, on average, at least g citations.

63 The h-index Jorge E. Hirsch introduced a new indicator called h-index for the assessment of the research performance of individual scientists. A scientist has index h if h of his or her N p papers have at least h citations each and the other (N p h) papers have h citations each. H, PNAS US, The papers meeting this criterion are also referred to as Hirsch core publications. An alternative index was introduced by Leo Egghe. A set of papers has a g-index g if g is the highest rank such that the top g papers have, together, at least g citations. E, Scientometrics, Alternative definition: g is the highest rank such that the top g papers have, on average, at least g citations.

64 The h-index The h- and g-index (Example) Ri Cites Mean

65 The h-index In order to define the theoretical h-index we use Gumbel s r-th characteristic extreme value u r. u r := G (r/n) = max{k : G(k) r/n}, where n is a the size of a given sample with distribution F, k, r n and G = F. The theoretical h-index (h), can be defined as { max{r : u r r} = max{r : max{k : G k r/n} r}, if n > and X, h :=, otherwise, whereas the empirical h-index (ĥ) can analogously be re-defined as { max{r : max{k : X k r} r}, if n > and X, ĥ :=, otherwise.

66 The asymptotic normality of the h-index Theorem (Beirlant & Einmahl, ) Let X be the number of citations received by an author s publications. The underlying citation distribution is denoted by F. If F is continuous and x = sup{x R : F(x) < } =, then ĥ h h + n(f(ĥ) F(h)) h D N (, ), for n. B E, Asymptotics for the Hirsch Index,

67 The asymptotic normality of the h-index Definition The distribution of the r.v. X is called Paretian if G(x) = x α l(x), where l is a slowly varying function such as l(ux) lim = for all u >. x l(x) Corollary (Beirlant & Einmahl, ) If F satisfies the von Mises condition, i.e. lim consistent estimator for α, then xg (x) x G(x) + α (ĥ ĥ h) D N (, ). = α and α is a

68 Estimation of the tail parameter The estimation of the tail parameter α of Pareto-type distributions has received much a ention. Assume that {X i } n i= is a sample of iid r.v.s with Paretian distribution. Then the ranked sample has the following property. P(k log(x k /X k+ ) < x) e αx ; k k Hence Hill s estimator ( ) for α can be derived as the mean of the upper k elements of this series. H k = k k log X i log X k+ i= H k is asymptotically normally distributed (if k n) with variance /(kα ). This allows to construct confidence intervals for α.

69 QQ-plots Remark Unfortunately, the Hill estimator is not robust since it depends on the particular choice of k, and is sensitive to large values of log(x k /X k+ ). Recently methods are developed to robustify the estimator and to reduce its bias. Another method for tail analysis is the quantile-quantile plot (QQ-plot), where observations are plo ed against the quantiles of a given distribution. If the observations follow this distribution, the graph is a straight line. According to Beirlant et al. ( ), the Paretian distributions result in a linear ( log(i/(n + ), log X i ) plot, where the slope is approximately α.

70 QQ-plots Remark Unfortunately, the Hill estimator is not robust since it depends on the particular choice of k, and is sensitive to large values of log(x k /X k+ ). Recently methods are developed to robustify the estimator and to reduce its bias. Another method for tail analysis is the quantile-quantile plot (QQ-plot), where observations are plo ed against the quantiles of a given distribution. If the observations follow this distribution, the graph is a straight line. According to Beirlant et al. ( ), the Paretian distributions result in a linear ( log(i/(n + ), log X i ) plot, where the slope is approximately α.

71 QQ-plots Pareto QQ-plots for (a) Garfield and (b) another Price Medallist with the same h-index (with k = and fi ed least squares line) B., Journal of Informetrics, ; Data source: Thomson Reuters Web of Knowledge

72 Conclusions Rank-based tests are sensitive to ties, which, in turn, o en occur if observations are integer-valued. These tests should therefore be applied with the utmost care. Mean values and relative frequencies are unbiased and consistent estimators for the expectation and the corresponding probabilities. Their use in bibliometrics is therefore correct and not just a workaround. However, the underlying publication set needs to be large enough. papers is a commonly accepted lower limit. The deviation of indicators like mean citation rates and the share of [un]cited papers from corresponding values of other samples or from given reference values or expectations (or probabilities) can be tested for significance, provided the size of the underlying paper set is large enough. Seemingly large deviations between indicators may prove to be not significant. The tail parameter provides information about the high-end of activity and (citation) impact.

73 Conclusions Rank-based tests are sensitive to ties, which, in turn, o en occur if observations are integer-valued. These tests should therefore be applied with the utmost care. Mean values and relative frequencies are unbiased and consistent estimators for the expectation and the corresponding probabilities. Their use in bibliometrics is therefore correct and not just a workaround. However, the underlying publication set needs to be large enough. papers is a commonly accepted lower limit. The deviation of indicators like mean citation rates and the share of [un]cited papers from corresponding values of other samples or from given reference values or expectations (or probabilities) can be tested for significance, provided the size of the underlying paper set is large enough. Seemingly large deviations between indicators may prove to be not significant. The tail parameter provides information about the high-end of activity and (citation) impact.

74 Conclusions Rank-based tests are sensitive to ties, which, in turn, o en occur if observations are integer-valued. These tests should therefore be applied with the utmost care. Mean values and relative frequencies are unbiased and consistent estimators for the expectation and the corresponding probabilities. Their use in bibliometrics is therefore correct and not just a workaround. However, the underlying publication set needs to be large enough. papers is a commonly accepted lower limit. The deviation of indicators like mean citation rates and the share of [un]cited papers from corresponding values of other samples or from given reference values or expectations (or probabilities) can be tested for significance, provided the size of the underlying paper set is large enough. Seemingly large deviations between indicators may prove to be not significant. The tail parameter provides information about the high-end of activity and (citation) impact.

75 Conclusions Rank-based tests are sensitive to ties, which, in turn, o en occur if observations are integer-valued. These tests should therefore be applied with the utmost care. Mean values and relative frequencies are unbiased and consistent estimators for the expectation and the corresponding probabilities. Their use in bibliometrics is therefore correct and not just a workaround. However, the underlying publication set needs to be large enough. papers is a commonly accepted lower limit. The deviation of indicators like mean citation rates and the share of [un]cited papers from corresponding values of other samples or from given reference values or expectations (or probabilities) can be tested for significance, provided the size of the underlying paper set is large enough. Seemingly large deviations between indicators may prove to be not significant. The tail parameter provides information about the high-end of activity and (citation) impact.

76 Conclusions Rank-based tests are sensitive to ties, which, in turn, o en occur if observations are integer-valued. These tests should therefore be applied with the utmost care. Mean values and relative frequencies are unbiased and consistent estimators for the expectation and the corresponding probabilities. Their use in bibliometrics is therefore correct and not just a workaround. However, the underlying publication set needs to be large enough. papers is a commonly accepted lower limit. The deviation of indicators like mean citation rates and the share of [un]cited papers from corresponding values of other samples or from given reference values or expectations (or probabilities) can be tested for significance, provided the size of the underlying paper set is large enough. Seemingly large deviations between indicators may prove to be not significant. The tail parameter provides information about the high-end of activity and (citation) impact.

Predicting long term impact of scientific publications

Predicting long term impact of scientific publications Master thesis Applied Mathematics (Chair: Stochastic Operations Research) Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) Predicting long term impact of scientific publications

More information

ELEMENTS OF PROBABILITY THEORY

ELEMENTS OF PROBABILITY THEORY ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable

More information

W G. Centre for R&D Monitoring and Dept. MSI, KU Leuven, Belgium

W G. Centre for R&D Monitoring and Dept. MSI, KU Leuven, Belgium W G Centre for R&D Monitoring and Dept. MSI, KU Leuven, Belgium Structure of presentation 1. 2. 2.1 Characteristic Scores and Scales (CSS) 3. 3.1 CSS at the macro level 3.2 CSS in all fields combined 3.3

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/30/17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

More information

The Mathematics of Scientific Research: Scientometrics, Citation Metrics, and Impact Factors

The Mathematics of Scientific Research: Scientometrics, Citation Metrics, and Impact Factors Wayne State University Library Scholarly Publications Wayne State University Libraries 1-1-2016 The Mathematics of Scientific Research: Scientometrics, Citation Metrics, and Impact Factors Clayton Hayes

More information

Stochastic process. X, a series of random variables indexed by t

Stochastic process. X, a series of random variables indexed by t Stochastic process X, a series of random variables indexed by t X={X(t), t 0} is a continuous time stochastic process X={X(t), t=0,1, } is a discrete time stochastic process X(t) is the state at time t,

More information

Relations between the shape of a size-frequency distribution and the shape of a rank-frequency distribution Link Peer-reviewed author version

Relations between the shape of a size-frequency distribution and the shape of a rank-frequency distribution Link Peer-reviewed author version Relations between the shape of a size-frequency distribution and the shape of a rank-frequency distribution Link Peer-reviewed author version Made available by Hasselt University Library in Document Server@UHasselt

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

Does k-th Moment Exist?

Does k-th Moment Exist? Does k-th Moment Exist? Hitomi, K. 1 and Y. Nishiyama 2 1 Kyoto Institute of Technology, Japan 2 Institute of Economic Research, Kyoto University, Japan Email: hitomi@kit.ac.jp Keywords: Existence of moments,

More information

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks Recap Probability, stochastic processes, Markov chains ELEC-C7210 Modeling and analysis of communication networks 1 Recap: Probability theory important distributions Discrete distributions Geometric distribution

More information

Modeling Recurrent Events in Panel Data Using Mixed Poisson Models

Modeling Recurrent Events in Panel Data Using Mixed Poisson Models Modeling Recurrent Events in Panel Data Using Mixed Poisson Models V. Savani and A. Zhigljavsky Abstract This paper reviews the applicability of the mixed Poisson process as a model for recurrent events

More information

It can be shown that if X 1 ;X 2 ;:::;X n are independent r.v. s with

It can be shown that if X 1 ;X 2 ;:::;X n are independent r.v. s with Example: Alternative calculation of mean and variance of binomial distribution A r.v. X has the Bernoulli distribution if it takes the values 1 ( success ) or 0 ( failure ) with probabilities p and (1

More information

Basic concepts of probability theory

Basic concepts of probability theory Basic concepts of probability theory Random variable discrete/continuous random variable Transform Z transform, Laplace transform Distribution Geometric, mixed-geometric, Binomial, Poisson, exponential,

More information

Shape of the return probability density function and extreme value statistics

Shape of the return probability density function and extreme value statistics Shape of the return probability density function and extreme value statistics 13/09/03 Int. Workshop on Risk and Regulation, Budapest Overview I aim to elucidate a relation between one field of research

More information

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

6.207/14.15: Networks Lecture 12: Generalized Random Graphs 6.207/14.15: Networks Lecture 12: Generalized Random Graphs 1 Outline Small-world model Growing random networks Power-law degree distributions: Rich-Get-Richer effects Models: Uniform attachment model

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770

Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770 Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770 Jonathan B. Hill Dept. of Economics University of North Carolina - Chapel Hill November

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2018 Examinations Subject CT3 Probability and Mathematical Statistics Core Technical Syllabus 1 June 2017 Aim The

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Exponential Tail Bounds

Exponential Tail Bounds Exponential Tail Bounds Mathias Winther Madsen January 2, 205 Here s a warm-up problem to get you started: Problem You enter the casino with 00 chips and start playing a game in which you double your capital

More information

Network models: dynamical growth and small world

Network models: dynamical growth and small world Network models: dynamical growth and small world Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics

More information

Math 576: Quantitative Risk Management

Math 576: Quantitative Risk Management Math 576: Quantitative Risk Management Haijun Li lih@math.wsu.edu Department of Mathematics Washington State University Week 11 Haijun Li Math 576: Quantitative Risk Management Week 11 1 / 21 Outline 1

More information

General Theory of Large Deviations

General Theory of Large Deviations Chapter 30 General Theory of Large Deviations A family of random variables follows the large deviations principle if the probability of the variables falling into bad sets, representing large deviations

More information

Basic concepts of probability theory

Basic concepts of probability theory Basic concepts of probability theory Random variable discrete/continuous random variable Transform Z transform, Laplace transform Distribution Geometric, mixed-geometric, Binomial, Poisson, exponential,

More information

The h-index of a conglomerate

The h-index of a conglomerate 1 The h-index of a conglomerate Ronald Rousseau 1,2, Raf Guns 3, Yuxian Liu 3,4 1 KHBO (Association K.U.Leuven), Industrial Sciences and Technology, Zeedijk 101, B-8400 Oostende, Belgium E-mail: ronald.rousseau@khbo.be

More information

Sample Spaces, Random Variables

Sample Spaces, Random Variables Sample Spaces, Random Variables Moulinath Banerjee University of Michigan August 3, 22 Probabilities In talking about probabilities, the fundamental object is Ω, the sample space. (elements) in Ω are denoted

More information

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes Lecture Notes 7 Random Processes Definition IID Processes Bernoulli Process Binomial Counting Process Interarrival Time Process Markov Processes Markov Chains Classification of States Steady State Probabilities

More information

HEAVY-TRAFFIC EXTREME-VALUE LIMITS FOR QUEUES

HEAVY-TRAFFIC EXTREME-VALUE LIMITS FOR QUEUES HEAVY-TRAFFIC EXTREME-VALUE LIMITS FOR QUEUES by Peter W. Glynn Department of Operations Research Stanford University Stanford, CA 94305-4022 and Ward Whitt AT&T Bell Laboratories Murray Hill, NJ 07974-0636

More information

Lecture 7: Simulation of Markov Processes. Pasi Lassila Department of Communications and Networking

Lecture 7: Simulation of Markov Processes. Pasi Lassila Department of Communications and Networking Lecture 7: Simulation of Markov Processes Pasi Lassila Department of Communications and Networking Contents Markov processes theory recap Elementary queuing models for data networks Simulation of Markov

More information

Math Camp II. Calculus. Yiqing Xu. August 27, 2014 MIT

Math Camp II. Calculus. Yiqing Xu. August 27, 2014 MIT Math Camp II Calculus Yiqing Xu MIT August 27, 2014 1 Sequence and Limit 2 Derivatives 3 OLS Asymptotics 4 Integrals Sequence Definition A sequence {y n } = {y 1, y 2, y 3,..., y n } is an ordered set

More information

I forgot to mention last time: in the Ito formula for two standard processes, putting

I forgot to mention last time: in the Ito formula for two standard processes, putting I forgot to mention last time: in the Ito formula for two standard processes, putting dx t = a t dt + b t db t dy t = α t dt + β t db t, and taking f(x, y = xy, one has f x = y, f y = x, and f xx = f yy

More information

NONPARAMETRIC ESTIMATION OF THE CONDITIONAL TAIL INDEX

NONPARAMETRIC ESTIMATION OF THE CONDITIONAL TAIL INDEX NONPARAMETRIC ESTIMATION OF THE CONDITIONAL TAIL INDE Laurent Gardes and Stéphane Girard INRIA Rhône-Alpes, Team Mistis, 655 avenue de l Europe, Montbonnot, 38334 Saint-Ismier Cedex, France. Stephane.Girard@inrialpes.fr

More information

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics The candidates for the research course in Statistics will have to take two shortanswer type tests

More information

p. 4-1 Random Variables

p. 4-1 Random Variables Random Variables A Motivating Example Experiment: Sample k students without replacement from the population of all n students (labeled as 1, 2,, n, respectively) in our class. = {all combinations} = {{i

More information

Asymptotic distribution of the sample average value-at-risk

Asymptotic distribution of the sample average value-at-risk Asymptotic distribution of the sample average value-at-risk Stoyan V. Stoyanov Svetlozar T. Rachev September 3, 7 Abstract In this paper, we prove a result for the asymptotic distribution of the sample

More information

Filtrations, Markov Processes and Martingales. Lectures on Lévy Processes and Stochastic Calculus, Braunschweig, Lecture 3: The Lévy-Itô Decomposition

Filtrations, Markov Processes and Martingales. Lectures on Lévy Processes and Stochastic Calculus, Braunschweig, Lecture 3: The Lévy-Itô Decomposition Filtrations, Markov Processes and Martingales Lectures on Lévy Processes and Stochastic Calculus, Braunschweig, Lecture 3: The Lévy-Itô Decomposition David pplebaum Probability and Statistics Department,

More information

Week 9 The Central Limit Theorem and Estimation Concepts

Week 9 The Central Limit Theorem and Estimation Concepts Week 9 and Estimation Concepts Week 9 and Estimation Concepts Week 9 Objectives 1 The Law of Large Numbers and the concept of consistency of averages are introduced. The condition of existence of the population

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

Northwestern University Department of Electrical Engineering and Computer Science

Northwestern University Department of Electrical Engineering and Computer Science Northwestern University Department of Electrical Engineering and Computer Science EECS 454: Modeling and Analysis of Communication Networks Spring 2008 Probability Review As discussed in Lecture 1, probability

More information

1 Degree distributions and data

1 Degree distributions and data 1 Degree distributions and data A great deal of effort is often spent trying to identify what functional form best describes the degree distribution of a network, particularly the upper tail of that distribution.

More information

Statistics for Economists. Lectures 3 & 4

Statistics for Economists. Lectures 3 & 4 Statistics for Economists Lectures 3 & 4 Asrat Temesgen Stockholm University 1 CHAPTER 2- Discrete Distributions 2.1. Random variables of the Discrete Type Definition 2.1.1: Given a random experiment with

More information

Data analysis and stochastic modeling

Data analysis and stochastic modeling Data analysis and stochastic modeling Lecture 7 An introduction to queueing theory Guillaume Gravier guillaume.gravier@irisa.fr with a lot of help from Paul Jensen s course http://www.me.utexas.edu/ jensen/ormm/instruction/powerpoint/or_models_09/14_queuing.ppt

More information

Optional Stopping Theorem Let X be a martingale and T be a stopping time such

Optional Stopping Theorem Let X be a martingale and T be a stopping time such Plan Counting, Renewal, and Point Processes 0. Finish FDR Example 1. The Basic Renewal Process 2. The Poisson Process Revisited 3. Variants and Extensions 4. Point Processes Reading: G&S: 7.1 7.3, 7.10

More information

Poisson Processes. Stochastic Processes. Feb UC3M

Poisson Processes. Stochastic Processes. Feb UC3M Poisson Processes Stochastic Processes UC3M Feb. 2012 Exponential random variables A random variable T has exponential distribution with rate λ > 0 if its probability density function can been written

More information

Concentration of Measures by Bounded Couplings

Concentration of Measures by Bounded Couplings Concentration of Measures by Bounded Couplings Subhankar Ghosh, Larry Goldstein and Ümit Işlak University of Southern California [arxiv:0906.3886] [arxiv:1304.5001] May 2013 Concentration of Measure Distributional

More information

CDA6530: Performance Models of Computers and Networks. Chapter 3: Review of Practical Stochastic Processes

CDA6530: Performance Models of Computers and Networks. Chapter 3: Review of Practical Stochastic Processes CDA6530: Performance Models of Computers and Networks Chapter 3: Review of Practical Stochastic Processes Definition Stochastic process X = {X(t), t2 T} is a collection of random variables (rvs); one rv

More information

Continuous-time Markov Chains

Continuous-time Markov Chains Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 23, 2017

More information

Continuous-Time Markov Chain

Continuous-Time Markov Chain Continuous-Time Markov Chain Consider the process {X(t),t 0} with state space {0, 1, 2,...}. The process {X(t),t 0} is a continuous-time Markov chain if for all s, t 0 and nonnegative integers i, j, x(u),

More information

Lecture 2. Distributions and Random Variables

Lecture 2. Distributions and Random Variables Lecture 2. Distributions and Random Variables Igor Rychlik Chalmers Department of Mathematical Sciences Probability, Statistics and Risk, MVE300 Chalmers March 2013. Click on red text for extra material.

More information

Poisson Processes for Neuroscientists

Poisson Processes for Neuroscientists Poisson Processes for Neuroscientists Thibaud Taillefumier This note is an introduction to the key properties of Poisson processes, which are extensively used to simulate spike trains. For being mathematical

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Heavy Tails: The Origins and Implications for Large Scale Biological & Information Systems

Heavy Tails: The Origins and Implications for Large Scale Biological & Information Systems Heavy Tails: The Origins and Implications for Large Scale Biological & Information Systems Predrag R. Jelenković Dept. of Electrical Engineering Columbia University, NY 10027, USA {predrag}@ee.columbia.edu

More information

Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data

Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data Maximum Likelihood Estimation of the Flow Size Distribution Tail Index from Sampled Packet Data Patrick Loiseau 1, Paulo Gonçalves 1, Stéphane Girard 2, Florence Forbes 2, Pascale Vicat-Blanc Primet 1

More information

Probability and Statistics Concepts

Probability and Statistics Concepts University of Central Florida Computer Science Division COT 5611 - Operating Systems. Spring 014 - dcm Probability and Statistics Concepts Random Variable: a rule that assigns a numerical value to each

More information

2. Variance and Higher Moments

2. Variance and Higher Moments 1 of 16 7/16/2009 5:45 AM Virtual Laboratories > 4. Expected Value > 1 2 3 4 5 6 2. Variance and Higher Moments Recall that by taking the expected value of various transformations of a random variable,

More information

MARKOV PROCESSES. Valerio Di Valerio

MARKOV PROCESSES. Valerio Di Valerio MARKOV PROCESSES Valerio Di Valerio Stochastic Process Definition: a stochastic process is a collection of random variables {X(t)} indexed by time t T Each X(t) X is a random variable that satisfy some

More information

Quantile-quantile plots and the method of peaksover-threshold

Quantile-quantile plots and the method of peaksover-threshold Problems in SF2980 2009-11-09 12 6 4 2 0 2 4 6 0.15 0.10 0.05 0.00 0.05 0.10 0.15 Figure 2: qqplot of log-returns (x-axis) against quantiles of a standard t-distribution with 4 degrees of freedom (y-axis).

More information

Weak convergence to the t-distribution

Weak convergence to the t-distribution Weak convergence to the t-distribution Christian Schluter and Mark Trede 21/2011 DEFI, Aix-Marseille Université, France and University of Southampton, UK Department of Economics, University of Münster,

More information

Extreme Value Theory as a Theoretical Background for Power Law Behavior

Extreme Value Theory as a Theoretical Background for Power Law Behavior Extreme Value Theory as a Theoretical Background for Power Law Behavior Simone Alfarano 1 and Thomas Lux 2 1 Department of Economics, University of Kiel, alfarano@bwl.uni-kiel.de 2 Department of Economics,

More information

MGR-815. Notes for the MGR-815 course. 12 June School of Superior Technology. Professor Zbigniew Dziong

MGR-815. Notes for the MGR-815 course. 12 June School of Superior Technology. Professor Zbigniew Dziong Modeling, Estimation and Control, for Telecommunication Networks Notes for the MGR-815 course 12 June 2010 School of Superior Technology Professor Zbigniew Dziong 1 Table of Contents Preface 5 1. Example

More information

LECTURE #6 BIRTH-DEATH PROCESS

LECTURE #6 BIRTH-DEATH PROCESS LECTURE #6 BIRTH-DEATH PROCESS 204528 Queueing Theory and Applications in Networks Assoc. Prof., Ph.D. (รศ.ดร. อน นต ผลเพ ม) Computer Engineering Department, Kasetsart University Outline 2 Birth-Death

More information

Point Process Control

Point Process Control Point Process Control The following note is based on Chapters I, II and VII in Brémaud s book Point Processes and Queues (1981). 1 Basic Definitions Consider some probability space (Ω, F, P). A real-valued

More information

Estimation of Quantiles

Estimation of Quantiles 9 Estimation of Quantiles The notion of quantiles was introduced in Section 3.2: recall that a quantile x α for an r.v. X is a constant such that P(X x α )=1 α. (9.1) In this chapter we examine quantiles

More information

Econometrics Summary Algebraic and Statistical Preliminaries

Econometrics Summary Algebraic and Statistical Preliminaries Econometrics Summary Algebraic and Statistical Preliminaries Elasticity: The point elasticity of Y with respect to L is given by α = ( Y/ L)/(Y/L). The arc elasticity is given by ( Y/ L)/(Y/L), when L

More information

The Fundamentals of Heavy Tails Properties, Emergence, & Identification. Jayakrishnan Nair, Adam Wierman, Bert Zwart

The Fundamentals of Heavy Tails Properties, Emergence, & Identification. Jayakrishnan Nair, Adam Wierman, Bert Zwart The Fundamentals of Heavy Tails Properties, Emergence, & Identification Jayakrishnan Nair, Adam Wierman, Bert Zwart Why am I doing a tutorial on heavy tails? Because we re writing a book on the topic Why

More information

Spatial and temporal extremes of wildfire sizes in Portugal ( )

Spatial and temporal extremes of wildfire sizes in Portugal ( ) International Journal of Wildland Fire 2009, 18, 983 991. doi:10.1071/wf07044_ac Accessory publication Spatial and temporal extremes of wildfire sizes in Portugal (1984 2004) P. de Zea Bermudez A, J. Mendes

More information

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY SECOND YEAR B.Sc. SEMESTER - III SYLLABUS FOR S. Y. B. Sc. STATISTICS Academic Year 07-8 S.Y. B.Sc. (Statistics)

More information

STA 2201/442 Assignment 2

STA 2201/442 Assignment 2 STA 2201/442 Assignment 2 1. This is about how to simulate from a continuous univariate distribution. Let the random variable X have a continuous distribution with density f X (x) and cumulative distribution

More information

Basic concepts of probability theory

Basic concepts of probability theory Basic concepts of probability theory Random variable discrete/continuous random variable Transform Z transform, Laplace transform Distribution Geometric, mixed-geometric, Binomial, Poisson, exponential,

More information

Extreme Value Analysis and Spatial Extremes

Extreme Value Analysis and Spatial Extremes Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models

More information

Discrete Distributions Chapter 6

Discrete Distributions Chapter 6 Discrete Distributions Chapter 6 Negative Binomial Distribution section 6.3 Consider k r, r +,... independent Bernoulli trials with probability of success in one trial being p. Let the random variable

More information

Lecture 4a: Continuous-Time Markov Chain Models

Lecture 4a: Continuous-Time Markov Chain Models Lecture 4a: Continuous-Time Markov Chain Models Continuous-time Markov chains are stochastic processes whose time is continuous, t [0, ), but the random variables are discrete. Prominent examples of continuous-time

More information

STAT 6385 Survey of Nonparametric Statistics. Order Statistics, EDF and Censoring

STAT 6385 Survey of Nonparametric Statistics. Order Statistics, EDF and Censoring STAT 6385 Survey of Nonparametric Statistics Order Statistics, EDF and Censoring Quantile Function A quantile (or a percentile) of a distribution is that value of X such that a specific percentage of the

More information

CDA5530: Performance Models of Computers and Networks. Chapter 3: Review of Practical

CDA5530: Performance Models of Computers and Networks. Chapter 3: Review of Practical CDA5530: Performance Models of Computers and Networks Chapter 3: Review of Practical Stochastic Processes Definition Stochastic ti process X = {X(t), t T} is a collection of random variables (rvs); one

More information

1 Probability and Random Variables

1 Probability and Random Variables 1 Probability and Random Variables The models that you have seen thus far are deterministic models. For any time t, there is a unique solution X(t). On the other hand, stochastic models will result in

More information

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R In probabilistic models, a random variable is a variable whose possible values are numerical outcomes of a random phenomenon. As a function or a map, it maps from an element (or an outcome) of a sample

More information

IEOR 8100: Topics in OR: Asymptotic Methods in Queueing Theory. Fall 2009, Professor Whitt. Class Lecture Notes: Wednesday, September 9.

IEOR 8100: Topics in OR: Asymptotic Methods in Queueing Theory. Fall 2009, Professor Whitt. Class Lecture Notes: Wednesday, September 9. IEOR 8100: Topics in OR: Asymptotic Methods in Queueing Theory Fall 2009, Professor Whitt Class Lecture Notes: Wednesday, September 9. Heavy-Traffic Limits for the GI/G/1 Queue 1. The GI/G/1 Queue We will

More information

Multivariate Normal-Laplace Distribution and Processes

Multivariate Normal-Laplace Distribution and Processes CHAPTER 4 Multivariate Normal-Laplace Distribution and Processes The normal-laplace distribution, which results from the convolution of independent normal and Laplace random variables is introduced by

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

STOCHASTIC PROCESSES Basic notions

STOCHASTIC PROCESSES Basic notions J. Virtamo 38.3143 Queueing Theory / Stochastic processes 1 STOCHASTIC PROCESSES Basic notions Often the systems we consider evolve in time and we are interested in their dynamic behaviour, usually involving

More information

Chapter 2. Random Variable. Define single random variables in terms of their PDF and CDF, and calculate moments such as the mean and variance.

Chapter 2. Random Variable. Define single random variables in terms of their PDF and CDF, and calculate moments such as the mean and variance. Chapter 2 Random Variable CLO2 Define single random variables in terms of their PDF and CDF, and calculate moments such as the mean and variance. 1 1. Introduction In Chapter 1, we introduced the concept

More information

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Random variables. DS GA 1002 Probability and Statistics for Data Science. Random variables DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Motivation Random variables model numerical quantities

More information

Estimation of risk measures for extreme pluviometrical measurements

Estimation of risk measures for extreme pluviometrical measurements Estimation of risk measures for extreme pluviometrical measurements by Jonathan EL METHNI in collaboration with Laurent GARDES & Stéphane GIRARD 26th Annual Conference of The International Environmetrics

More information

LIMITS FOR QUEUES AS THE WAITING ROOM GROWS. Bell Communications Research AT&T Bell Laboratories Red Bank, NJ Murray Hill, NJ 07974

LIMITS FOR QUEUES AS THE WAITING ROOM GROWS. Bell Communications Research AT&T Bell Laboratories Red Bank, NJ Murray Hill, NJ 07974 LIMITS FOR QUEUES AS THE WAITING ROOM GROWS by Daniel P. Heyman Ward Whitt Bell Communications Research AT&T Bell Laboratories Red Bank, NJ 07701 Murray Hill, NJ 07974 May 11, 1988 ABSTRACT We study the

More information

The Nonparametric Bootstrap

The Nonparametric Bootstrap The Nonparametric Bootstrap The nonparametric bootstrap may involve inferences about a parameter, but we use a nonparametric procedure in approximating the parametric distribution using the ECDF. We use

More information

Week 2. Review of Probability, Random Variables and Univariate Distributions

Week 2. Review of Probability, Random Variables and Univariate Distributions Week 2 Review of Probability, Random Variables and Univariate Distributions Probability Probability Probability Motivation What use is Probability Theory? Probability models Basis for statistical inference

More information

Lecture notes for /12.586, Modeling Environmental Complexity. D. H. Rothman, MIT October 20, 2014

Lecture notes for /12.586, Modeling Environmental Complexity. D. H. Rothman, MIT October 20, 2014 Lecture notes for 12.086/12.586, Modeling Environmental Complexity D. H. Rothman, MIT October 20, 2014 Contents 1 Random and scale-free networks 1 1.1 Food webs............................. 1 1.2 Random

More information

Part I Stochastic variables and Markov chains

Part I Stochastic variables and Markov chains Part I Stochastic variables and Markov chains Random variables describe the behaviour of a phenomenon independent of any specific sample space Distribution function (cdf, cumulative distribution function)

More information

375 PU M Sc Statistics

375 PU M Sc Statistics 375 PU M Sc Statistics 1 of 100 193 PU_2016_375_E For the following 2x2 contingency table for two attributes the value of chi-square is:- 20/36 10/38 100/21 10/18 2 of 100 120 PU_2016_375_E If the values

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Model Fitting. Jean Yves Le Boudec

Model Fitting. Jean Yves Le Boudec Model Fitting Jean Yves Le Boudec 0 Contents 1. What is model fitting? 2. Linear Regression 3. Linear regression with norm minimization 4. Choosing a distribution 5. Heavy Tail 1 Virus Infection Data We

More information

Module 9: Stationary Processes

Module 9: Stationary Processes Module 9: Stationary Processes Lecture 1 Stationary Processes 1 Introduction A stationary process is a stochastic process whose joint probability distribution does not change when shifted in time or space.

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

1: PROBABILITY REVIEW

1: PROBABILITY REVIEW 1: PROBABILITY REVIEW Marek Rutkowski School of Mathematics and Statistics University of Sydney Semester 2, 2016 M. Rutkowski (USydney) Slides 1: Probability Review 1 / 56 Outline We will review the following

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Quick Review on Linear Multiple Regression

Quick Review on Linear Multiple Regression Quick Review on Linear Multiple Regression Mei-Yuan Chen Department of Finance National Chung Hsing University March 6, 2007 Introduction for Conditional Mean Modeling Suppose random variables Y, X 1,

More information

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm Midterm Examination STA 215: Statistical Inference Due Wednesday, 2006 Mar 8, 1:15 pm This is an open-book take-home examination. You may work on it during any consecutive 24-hour period you like; please

More information

V. Properties of estimators {Parts C, D & E in this file}

V. Properties of estimators {Parts C, D & E in this file} A. Definitions & Desiderata. model. estimator V. Properties of estimators {Parts C, D & E in this file}. sampling errors and sampling distribution 4. unbiasedness 5. low sampling variance 6. low mean squared

More information

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321 Lecture 11: Introduction to Markov Chains Copyright G. Caire (Sample Lectures) 321 Discrete-time random processes A sequence of RVs indexed by a variable n 2 {0, 1, 2,...} forms a discretetime random process

More information