Robust Lifetime Measurement in Large-Scale P2P Systems with Non-Stationary Arrivals

Similar documents
Robust Lifetime Measurement in Large- Scale P2P Systems with Non-Stationary Arrivals

144 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 17, NO. 1, FEBRUARY A PDF f (x) is completely monotone if derivatives f of all orders exist

On Superposition of Heterogeneous Edge Processes in Dynamic Random Graphs

On Lifetime-Based Node Failure and Stochastic Resilience of Decentralized Peer-to-Peer Networks

On Static and Dynamic Partitioning Behavior of Large-Scale Networks

On Lifetime-Based Node Failure and Stochastic Resilience of Decentralized Peer-to-Peer Networks

Northwestern University Department of Electrical Engineering and Computer Science

Estimation of DNS Source and Cache Dynamics under Interval-Censored Age Sampling

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Chapter 9. Non-Parametric Density Function Estimation

Modeling Heterogeneous User Churn and Local Resilience of Unstructured P2P Networks

DS-GA 1002 Lecture notes 2 Fall Random variables

Capturing Network Traffic Dynamics Small Scales. Rolf Riedi

arxiv: v1 [math.ho] 25 Feb 2008

Chapter 9. Non-Parametric Density Function Estimation

Queueing Theory I Summary! Little s Law! Queueing System Notation! Stationary Analysis of Elementary Queueing Systems " M/M/1 " M/M/m " M/M/1/K "

M/G/1 and M/G/1/K systems

Poisson Processes for Neuroscientists

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 8 10/1/2008 CONTINUOUS RANDOM VARIABLES

Min Congestion Control for High- Speed Heterogeneous Networks. JetMax: Scalable Max-Min

Chapter 2. Poisson Processes. Prof. Shun-Ren Yang Department of Computer Science, National Tsing Hua University, Taiwan

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

Statistics: Learning models from data

Session-Based Queueing Systems

Dr. Arno Berger School of Mathematics

Online Supplement to Creating Work Breaks From Available Idleness

Online Supplement to Creating Work Breaks From Available Idleness

Cover Page. The handle holds various files of this Leiden University dissertation

Modelling the risk process

Chapter 2: Random Variables

Efficient Network-wide Available Bandwidth Estimation through Active Learning and Belief Propagation

Direct Classification with Indirect Data

AS computer hardware technology advances, both

Class 11 Non-Parametric Models of a Service System; GI/GI/1, GI/GI/n: Exact & Approximate Analysis.

A Virtual Queue Approach to Loss Estimation

Unstructured P2P Link Lifetimes Redux

An Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters

ec1 e-companion to Liu and Whitt: Stabilizing Performance

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions

1 Random Variable: Topics

Random Walk on a Graph

Model Fitting. Jean Yves Le Boudec

Figure 10.1: Recording when the event E occurs

ANALYTICAL MODEL OF A VIRTUAL BACKBONE STABILITY IN MOBILE ENVIRONMENT

Queueing Theory and Simulation. Introduction

Lecturer: Olga Galinina

Mean-Field Analysis of Coding Versus Replication in Large Data Storage Systems

Chapter 2 Queueing Theory and Simulation

Sample Spaces, Random Variables

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

Part I Stochastic variables and Markov chains

1 Modelling and Simulation

Testing Problems with Sub-Learning Sample Complexity

14 Random Variables and Simulation

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Schedulability analysis of global Deadline-Monotonic scheduling

A General Overview of Parametric Estimation and Inference Techniques.

Dynamic resource sharing

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On the Partitioning of Servers in Queueing Systems during Rush Hour

Queueing systems. Renato Lo Cigno. Simulation and Performance Evaluation Queueing systems - Renato Lo Cigno 1

Random Walk Based Algorithms for Complex Network Analysis

2tdt 1 y = t2 + C y = which implies C = 1 and the solution is y = 1

Simulation of Process Scheduling Algorithms

CSCI-6971 Lecture Notes: Monte Carlo integration

High-Dimensional Indexing by Distributed Aggregation

Since D has an exponential distribution, E[D] = 0.09 years. Since {A(t) : t 0} is a Poisson process with rate λ = 10, 000, A(0.

Glossary availability cellular manufacturing closed queueing network coefficient of variation (CV) conditional probability CONWIP

An Ins t Ins an t t an Primer

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

Continuous-time Markov Chains

Continuous-Valued Probability Review

A Queueing System with Queue Length Dependent Service Times, with Applications to Cell Discarding in ATM Networks

LIMITS FOR QUEUES AS THE WAITING ROOM GROWS. Bell Communications Research AT&T Bell Laboratories Red Bank, NJ Murray Hill, NJ 07974

Load Regulating Algorithm for Static-Priority Task Scheduling on Multiprocessors

Uniform random numbers generators

Overall Plan of Simulation and Modeling I. Chapters

Section 7.5: Nonstationary Poisson Processes

If we want to analyze experimental or simulated data we might encounter the following tasks:

THE QUEEN S UNIVERSITY OF BELFAST

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

COMP9334 Capacity Planning for Computer Systems and Networks

Markov processes and queueing networks

ESTIMATING WAITING TIMES WITH THE TIME-VARYING LITTLE S LAW

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Modeling Residual-Geometric Flow Sampling

Password Cracking: The Effect of Bias on the Average Guesswork of Hash Functions

Bayesian Methods for Machine Learning

Unsupervised Learning with Permuted Data

Probability and Statistics Concepts

10.1 The Formal Model

STA 4273H: Statistical Machine Learning

arxiv: v2 [cs.ds] 3 Oct 2017

On the Partitioning of Servers in Queueing Systems during Rush Hour

Exponential Distribution and Poisson Process

A PARAMETRIC DECOMPOSITION BASED APPROACH FOR MULTI-CLASS CLOSED QUEUING NETWORKS WITH SYNCHRONIZATION STATIONS

IEOR 6711: Stochastic Models I SOLUTIONS to the First Midterm Exam, October 7, 2008

End-to-end Estimation of the Available Bandwidth Variation Range

RMSC 2001 Introduction to Risk Management

Transcription:

Robust Lifetime Measurement in Large-Scale P2P Systems with Non-Stationary Arrivals Xiaoming Wang, Zhongmei Yao, Yueping Zhang, and Dmitri Loguinov Abstract Characterizing user churn has become an important topic in studying P2P networks, both in theoretical analysis and system design. Recent work [24] has shown that direct sampling of user lifetimes may lead to certain bias (arising from missed peers and round-off inconsistencies) and proposed a technique that estimates lifetimes based on sampled residuals. In this paper, however, we show that under non-stationary arrivals, which are often present in real systems, residual-based sampling does not correctly reconstruct user lifetimes and suffers a varying degree of bias, which in some cases makes estimation completely impossible. We overcome this problem using two contributions: a novel non-stationary ON/OFF churn model and an unbiased randomized residual sampling technique for measuring user lifetimes. The former allows correlation between ON/OFF periods of the same user and exhibits different join rates during the day. The latter spreads sampling points uniformly during the day and uses a novel estimator to reconstruct the underlying lifetime distribution. We finish the paper with experimental measurements of Gnutella and discussing reduction in overhead compared to direct sampling methods. I. INTRODUCTION The problem of measuring temporal and topological characteristics of large-scale peer-to-peer networks such as Gnutella [6] and KaZaA [9] has recently received considerable attention [1], [2], [3], [13], [18], [2], [24]. One of the central elements in capturing the dynamics of P2P systems is the lifetime distribution of participants, which can provide valuable input to throughput models [5], [16], resilience analysis [11], [26], [27], and system design [7], [12], [18]. Previous efforts in sampling lifetimes can be categorized into two classes: direct sampling [2], [18], [2], which performs periodic crawls of the system to detect new peer arrival and measures their lifetimes, and indirect sampling [24], which scans the entire system only once and monitors all discovered peers until they depart to obtain their residual session lengths. While the latter estimator is unbiased after proper conversion of residuals to lifetimes and requires several orders of magnitude less bandwidth than the former [24], it relies on one crucial assumption stationarity of the arrival process. It thus remains to be seen whether the same benefits can be achieved in systems that exhibit diurnal arrival/departure patterns or any other non-stationary dynamics. However, in order to study this question rigorously, one requires a non-stationary model An earlier version of the paper appeared in IEEE P2P 29. Xiaoming Wang, Zhongmei Yao, and Dmitri Loguinov are with Texas A&M University, College Station, TX 77843 ({xmwang, mayyao, dmitri}@cs.tamu.edu). Yueping Zhang is with NEC Laboratories America, Inc., Princeton, NJ 854 (yueping@nec-labs.com). of user behavior and the corresponding analysis of lifetime sampling. We focus on these issues next. A. Non-Stationary User Churn Recall that traditional analytical P2P work either directly assumes stationary Poisson arrivals [1], [14], [15], [21] or models users with equilibrium ON/OFF renewal processes [11], [24], [27], whose scaled superposition tends to a stationary Poisson process for sufficiently large system size [26]. In our comparison with related work, we only consider the approach of [26], which we call Stationary Renewal Churn Model (SR-CM), since it includes all other models as special cases. We start the paper by designing a novel generic arrival model for Internet users that can replicate first-order dynamics (i.e., mean arrival rate) of almost any non-stationary churn process. In the proposed approach, which we call Non- Stationary Periodic Churn Model (NS-PCM), each user alternates between ON (alive) and OFF (dead) states. As before, the duration L of ON cycles is drawn from the distribution of user lifetime F L (x), but OFF states are now split into two sub-states: REST and WAIT. The former sub-state can be visualized as the delay between the user s departure and midnight of the day when he/she joins the system again. The latter sub-state is the delay from midnight until the user s arrival into the system within a given day, which follows its own distribution F A (x). Unlike prior models, NS-PCM allows OFF periods to be dependent on the time of day and the duration of the previous ON cycle (i.e., user lifetime). We derive that the average arrival rate λ(t) during the day is given by nτf A (t)/, where n is the system size, τ is the period of the arrival process, is the average inter-arrival delay of a user, and f A (t) = F A (t) is the PDF of WAIT time. Thus, NS-PCM can achieve any continuous non-stationary periodic arrival rate by adjusting density f A (x) and includes SR-CM as a special case with f A (x) = 1/τ. We show examples of using NS-PCM to model Gnutella and then analyze its impact on the existing sampling methods in distributed P2P systems. B. Analysis of Existing Methods Equipped with the new model, we examine two major existing paradigms for measuring the lifetime distribution: Create-Based Method (CBM) [17] and ResIDual-based Estimator (RIDE) [24]. The former takes snapshots of the system every time units within some window W and builds a distribution of observed lifetimes as an estimate of F L (x). The latter takes only one full snapshot of the system and

2 probes discovered users every units until they die or the observation window ends. The measured residuals are used to infer the target distribution F L (x) using equilibrium renewaltheory assumptions. While [24] analyzes both approaches for accuracy, it does so assuming stationary arrivals into the system under SR-CM. We perform the same task using the new model NS-PCM and obtain several interesting results. First, we show that the bias in CBM is now affected not only by and the lifetime distribution F L (x), but also by the arrival CDF F A (x). This makes removal of the bias much harder as it requires knowing the arrival pattern of users. Second, we derive the exact distribution produced by CBM and establish that it is unbiased only when = or F L (x) is exponential. Third, we find that RIDE s estimator under non-stationary churn does not converge and sometimes produces completely invalid results (including CDF functions that are non-monotonic). To understand the cause of sampling bias in RIDE, we investigate the distribution of residual lifetimes in systems driven by NS-PCM. Define R(t) to be the remaining session duration of a random online user at time t and H(x, t) = P (R(t) x) to be the CDF of residuals of currently alive users. Our analysis shows that unlike in prior models where t H(x, t) = H(x) existed, NS- PCM does not admit a iting distribution of R(t), which explains why RIDE s manipulation of non-existing metrics produces unpredictable results. Finally, we show that RIDE s bias under NS-PCM cannot be einated even with = and that accurate estimation is possible only when λ(t) = λ is a constant (i.e., stationary churn) or lifetimes F L (x) are exponential, neither of which is a realistic assumption in practice [8], [18], [19], [22]. It therefore remains an open problem to design a lowoverhead and robust lifetime estimator for distributed systems commonly found in real life. We perform this task next. C. U-RIDE To preserve the advantage of residual sampling in terms of overhead, we design a novel sampling algorithm called Uniform ResIDual-based Estimator (U-RIDE), which measures the system in uniformly random points in the observation window. The naive approach would be to compute the expected residual distribution E[H(x, U)], where U is a uniformly random sampling time within the period τ of the arrival process; however, we show that this expectation does not allow reconstruction of user lifetimes and is generally not related to F L (x) in closed-form. Instead, we derive a different estimator using renewal-reward theory and show that it allows unbiased estimation of F L (x) under the most general conditions of NS-PCM. The first component of U-RIDE is a sample-scheduling algorithm, which decides random time instances for residual sampling. We study one such algorithm that we call Bernoulli Scheduling (BS), which leverages the BASTA principle (Bernoulli Arrival See Time Average) [25] and allows accurate measurement even when the network is small or the period τ of λ(t) is unknown. The second component of U- RIDE is a residual processing algorithm, which aggregates residual samples obtained by the first component and outputs a statistical quantity that can be used to estimate F L (x). We show that our aggregation algorithm can be efficiently implemented in large systems and that it admits a subsampling technique similar to the one in [24]. Simulation results demonstrate that U-RIDE is able to accurately estimate the actual lifetime distribution F L (x) in a variety of non-stationary systems driven by NS-PCM. D. Experiments We finish the paper with deploying U-RIDE in the Gnutella network [6], a large P2P file sharing system of roughly 6 million concurrent users. We evaluate U-RIDE using over 26M peer lifetime samples and show that RIDE [24] indeed exhibits non-trivial error compared to CBM whose bias we neglect given the small used in the experiments. On the other hand, the proposed algorithm U-RIDE produces very accurate estimation and tracks CBM distributions precisely, but at the same time reduces overhead by two orders of magnitude. Since U-RIDE is a generic sampling method that does not assume anything specific to Gnutella, it is suitable for many large, non-stationary distributed systems found in today s Internet. The remainder of the paper is organized as follows. We introduce a new user churn model in Section II and examine the existing sampling algorithms in Section III. We propose our new method in Section IV and evaluate it in Gnutella experiments in Section V. Section VI reviews prior work and Section VII concludes the paper. II. NON-STATIONARY USER CHURN In this section, we cover basic definitions, briefly discuss prior arrival models, present our simple approach for generating non-stationary churn, and examine its ability to replicate arrival rates in Gnutella [6]. A. Basics Two important metrics of interest in any churn model are the arrival process and its rate. Let M i (t) be the number of arrivals from user i into the system in [, t] and assume λ i (t) is the corresponding arrival rate (whose existence we prove below under certain assumptions): E[M i (t + h) M i (t)] λ i (t) =. (1) h h The aggregate arrival process of the system is then M(t) = n i=1 M i(t) and its rate is λ(t) = n i=1 λ i(t), where n is the total number of participating users. Our interest in stationarity of a process is solely related to its rate as defined next. Definition 1: Arrival process M(t) is called rate-stationary if λ(t) = λ is simply a constant and non-stationary otherwise. To understand the properties of non-stationary processes, define: τ = inf{τ : λ(t + τ) = λ(t), t } to be the period of arrival rate λ(t). Note that τ = implies a stationary process and τ > non-stationary. The latter type can be further classified as follows.

3 L i,k D i,k L i,k+1 A i,k L i,k Y i,k A i,k+1 ON OFF ON WAIT WAIT T i,k T i,k+1 S i,k T i,k REST S i,k+1 T i,k+1 Fig. 1. User process Z i (t) under SR-CM. Fig. 2. User process Z i (t) under NS-PCM, where dashed vertical lines represent bin boundaries. Definition 2: Non-stationary process M(t) is called rateperiodic if < τ < and rate-aperiodic if τ =. Note that most real-life churn falls under the category of rate-periodic. We are now ready to examine prior churn models and overcome their itations. B. Stationary Renewal Churn Model (SR-CM) Recall that [26] models each user in P2P systems using an alternating ON/OFF renewal process: { 1 user i is alive at t (ON) Z i (t) =, (2) otherwise (OFF) which is illustrated in Fig. 1, where {L i,k } k=1 are random ON durations, {D i,k } k=1 are random OFF durations, and {T i,k } k=1 are arrival times of user i. Note that the renewal nature of this process implies that all ON/OFF durations are independent of each other, which makes each Z i (t) stationary as t. As a result, superposition of n such arrival processes converges to a stationary point process with constant rate λ(t) = λ. Since this stationarity does not match churn characteristics observed in Gnutella and other P2P systems [8], [18], [19], [22], one requires a much more general approach, which we offer next. C. Non-Stationary Periodic Churn Model (NS-PCM) As before, assume that each user i is modeled by an alternating ON/OFF point process Z i (t) in (2); however, it is no longer renewal as we allow OFF cycles {D i,k } to depend on both lifetimes {L i,k } and the time when the current OFF cycle starts. Specifically, assume τ < is the period of the system that we aim to model (e.g., for common human activity, τ = 24 hours) and partition time t into bins of τ units each. For any point t [, ), define b(t) = τ t/τ to be the start of the corresponding bin, e(t) = τ t/τ to be its end, and t = t b(t) to be the offset of t within its bin. Further denote by S i,k = b(t i,k ) the beginning of the bin where user i arrives for the k-th time and assume each arrival occurs only once per bin 1. As shown in Fig. 2, the OFF period in the current bin [S i,k, S i,k +τ] starts with a WAIT duration A i,k, which models the habits of users and their arrival preferences during the day. After process Z i (t) transitions to the ON state, the user stays logged in for a random lifetime L i,k and then departs from the system. Afterwards, the user stays in the REST state until time S i,k+1 (i.e., the beginning of the bin when i decides to return into the system), from which point the process repeats. Note 1 A user arriving m times in a given bin can be represented by m different users with arrivals scattered throughout the day. that the combination of REST and WAIT sub-states comprises the OFF state of (2) and that each REST duration may include a random number of full bins Y i,k, which represent long-term absence cycles of the user from the Internet. Furthermore, observe in the figure that OFF durations are clearly dependent not only on user lifetimes in the same cycle, but also on the time of departure. We next make two assumptions that allow this system to be tractable in closed-form. Assumption 1: Sequence {L i,k } k=1 consists of i.i.d. variables with CDF F L (x), {A i,k } k=1 is i.i.d. with differentiable CDF F A (x), and {Y i,k } k=1 is i.i.d. with CDF F Y (x). Furthermore, these sequences are pair-wise independent. Given this assumption, we can replace each user i s lifetimes with a random variable L F L (x) such that < E[L] <, its WAIT durations with A F A (x) where F A (τ) = 1, and its absence times with Y F Y (x). Pair-wise independence means that the lattice process defined on points {S i,k } k=1 for each user i is renewal (formally established below), even though Z i (t) is not. Additionally, notice that inter-arrival delays {T i,k+1 T i,k } k=1 are i.i.d. and do not depend on user i. Our second assumption prevents synchronization between different users and ensures sufficient variety of samples collected from crawling the system. Assumption 2: Processes {Z i (t)} n i=1 are mutually independent. We call the system defined by the above rules and assumptions Non-Stationary Periodic Churn Model (NS-PCM). The following lemma reveals an important property of the point process formed by {S i,k }, i.e., bin boundaries before each arrival (see Fig. 2). Lemma 1: Point process {S i,k } k=1 is lattice and renewal. Proof: Notice that S i,k+1 S i,k can be expressed as: S i,k+1 S i,k = e(a i,k + L i,k ) + Y i,k, where e(t) = τ t/τ is the end of the bin that contains t. From Assumption 1, it follows that interval {S i,k+1 S i,k } k=1 is i.i.d. and lattice, which establishes the desired result. Next, we use Lemma 1 to show that arrival rate λ(t) under NS-PCM is a simple periodic function determined by F A (x). Lemma 2: Suppose that time t is sufficiently large. Then, arrival rate λ(t) of an NS-PCM system with n users exists and is a periodic function given by: λ(t) = nτf A(t ), (3) where t [, τ) is the offset of t within the bin, = E[T i,k+1 T i,k ] is the average inter-arrival delay of a user, and f A (x) = F A (x) is the PDF of arrival time A.

4 Proof: First of all, denote by I i (t, t+h) a random variable indicating whether user i has an arrival within interval [t, t+h), where h > is small enough so that t and t + h are in the same bin. Then, we rewrite arrival rate λ(t) defined in (1) as follows: E[M(t + h) M(t)] λ(t) = h h n E[ i=1 = I i(t, t + h)] h h n i=1 = P (I i(t, t + h) = 1). (4) h h Next, we derive P (I i (t, t+h) = 1), which is the probability for a user i to have an arrival in interval [t, t + h). Notice that I i (t, t+h) is equivalent to the event that there exists an integer k such that arrival time T i,k [t, t + h): P (I i (t, t + h) = 1) = P (T i,k [t, t + h)). (5) Further notice that T i,k = S i,k + A i,k by definition, where S i,k and A i,k are the start and offset of the bin containing T i,k. Thus, T i,k [t, t + h) is equivalent to the event that T i,k is contained by the same bin as t and the offset of T i,k is included by interval [t, t + h]: T i,k [t, t + h) (S i,k = b(t)) (A i,k [t, t + h]). (6) Since S i,k and A i,k are independent by Assumption 1, it thus follows from (5)-(6) that: P (I i (t, t + h) = 1) = P (S i,k = b(t)) P (A i,k [t, t + h)). (7) For P (A i,k [t, t + h)), we simply have that: P (A i,k [t, t + h)) = F A (t + h) F A (t ). (8) It remains to derive the probability P (S i,k = b(t)). Notice that S i,k = b(t) for any k is equivalent to the event that t hits the first bin of the k-th cycle. Note that point process {S i,k } is proved in Lemma 1 to be a lattice renewal process. Using renewal theory, we obtain that time t hits the first bin with probability 1/η, where η is the expected number of bins in cycle [S i,k, S i,k+1 ). Further note that η = /τ, where is the expected length of cycle [S i,k, S i,k+1 ): = E[S i,k+1 S i,k ] = E[T i,k+1 T i,k ], (9) where the second equality comes from the fact that E[A i,k ] = E[A i,k+1 ] by Assumption 1. It thus follows that: P (P (S i,k = b(t))) = 1/η = τ/. (1) Substituting (8) and (1) into (7) establishes that: P (I i (t, t + h) = 1) = τ(f A(t + h) F A (t )). (11) Further utilizing (11) and considering that n users are homogeneous, we reduce (4) into: λ(t) = h nτ(f A (t + h) F A (t )) h = nτf A(t ), (12) arrival rate (users/sec) 1 9 8 7 6 5 4 3 6/15 6/16 6/17 6/18 6/19 6/2 time (a) Gnutella arrival rate (users/sec) 1 9 8 7 6 5 4 3 1 2 3 4 5 6 7 time (days) (b) NS-PCM simulations Fig. 3. User arrival rate: a) observed in Gnutella during June 14-2, 27; b) obtained from NS-PCM simulations. where the second equality is guaranteed by the assumption that F A (x) is differentiable every in [, τ). The desired result follows immediately. Using (3), one can approximate first-order dynamics of a wide class of systems with both stationary and non-stationary arrivals. For example, setting f A (x) = 1/τ, we obtain λ(t) = n/ = λ, which is identical to SR-CM. To illustrate a more interesting example, we first collect arrival rates from a 7- day measurements of the Gnutella network (see Section V for details) and plot them in Fig. 3(a), which indicates a clear pattern of diurnal churn. Then, we average the empirical arrival rate λ(t) over the observed 7 days to obtain the parameters of NS-PCM. Specifically, integrating (3), we get for x [, τ): f A (x) = λ(x) τ λ(t)dt, = nτ (13) λ(t)dt. Finally, we generate a system of n = 1, users with A drawn from f A (x) and plot the resulting instantaneous arrival rates in Fig. 3(b), which shows a random arrival pattern very similar to that of Gnutella. τ III. ANALYSIS OF EXISTING METHODS In this section, we characterize the accuracy of existing measurement methods under NS-PCM. Discussion of the associated overhead is presented in Section IV. A. Basics Suppose that the target P2P system is fully decentralized and the sampling process has recurring access to the information about which users are currently present in the system. This process allows us to test whether a given user i is still alive as well as discover the entire population of the network at any time t (e.g., using crawls). The goal of the sampling process is to estimate with as much accuracy as possible function F L (x), which we assume is continuous everywhere in the interval (, ). However, due to bandwidth and connectiondelay constraints on obtaining this information, the sampling process cannot query the system for longer than W time units or more frequently than once per interval, where usually varies from several minutes to several hours depending on the speed of the measurement facility and network size. These constraints lead to the following two properties: 1) all lifetime samples are discrete and rounded to a multiple of ; and 2) all samples are no larger than W.

5 Denote by A the sampling algorithm of interests and by V A its sample set after an infinite measurement. Further define E A (x) to be the estimator function computed from set V A for approximating the value of F L (x) = P (L x). Note that E A (x) is discrete and can be arbitrarily different from target distribution F L (x). Definition 3 ([24]): Estimator E A (x) of algorithm A is unbiased with respect to a target continuous random variable L if it matches the distribution of L at all discrete points x j = j, j = 1, 2,..., W/ in the interval [, W ] for any > : E A (x j ) = P (L x j ). (14) Notice that empirical distributions based on a finite set V A will not generally match the target distribution F L (x), which is not a source of bias but rather a itation of the finite measurement process. Definition 3 instead refers to errors that cannot be einated by sampling the system indefinitely. B. Create-Based Method (CBM) CBM was first introduced by [17] in the context of operating systems and later applied to peer-to-peer networks by [2], [18], [2]. Recall from [17] that CBM uses an observation window of size 2W, which is split into small intervals of size. Within the observation window [, 2W ], the algorithm takes snapshots of the system at points x j = j, i.e., at the beginning of each interval. To avoid sampling bias, [17] suggests dividing the window into two halves and only including in sample set V C lifetimes that appear during the first half of the window. Based on V C, define E C (x j ) to be the CBM estimator of the lifetime distribution F L (x): E C (x j ) = N C (x j ), (15) N C N C where N C = V C is the size of the sample set and N C (x) is the number of seen users with lifetimes no larger than x. As formalized by [24], there are two possible causes of bias in CBM sampling: 1) missed peers that join and depart between consequent crawls; 2) random direction of round-offs (i.e., some samples rounded up and others down). We say a user s lifetime L such that x j L < x j+1 is inconsistently sampled if it is rounded down to x j and consistently sampled otherwise (i.e., rounded up to x j+1 ). Define ρ j to be the probability of inconsistent round-offs for lifetimes in the interval [x j, x j+1 ), where ρ refers to the probability of missing a user. The next theorem indicates that the bias in CBM under NS- PCM is determined not only by and lifetime distribution F L (x), but also by the arrival distribution F A (x). Theorem 1: Under NS-PCM, CBM estimator (15) produces the following distribution: where ρ j is given by: ρ j = v= E C (x j ) = F L(x j ) ρ + ρ j 1 ρ, (16) xv+1 x v (F L (x v+j+1 y) F L (x j ))f A (y )dy W, f A (u )du (17) X t =... x v x v+1... x v+j x v+j+1 Fig. 4. Illustration of inconsistent round-off in CBM, where arrival time X [x v, x v+1 ] and lifetime L (x j, x j+1 ]. Vertical dotted lines stand for CBM sampling points with interval. Gray area represents the region of inconsistent sampling, that is, if the user departs within the gray area, its lifetime could be inconsistently rounded to x j. The gap between x v+j and the beginning of the gray area is given by d. F L (x) is the CDF of the lifetime distribution, and f A (x) is the PDF of arrivals. Proof: Note that (16) can be proved using exactly the same reasoning as in [24, Theorem 3]. In what follows, we thus focus on deriving the result of ρ j in (17), the probability of inconsistent round-offs for lifetimes in the interval [x j, x j+1 ). Without loss of generality, we shift the time origin to the start time t, i.e., t = so that the first half of the window is given by [, W ]. Suppose that a user arrives at time X [, W ], where X is relative to t, and its lifetime is x j < L x j+1, where x j = j. Further assume that arrives time X [x v, x v+1 ]. Denote by d = X x v the gap between X and x v. Then, we establish that the user lifetime is inconsistently rounded off to x j if and only if departure time X +L is within [x v+j + d, x v+j+1 ), which is illustrated in Fig. 4 as gray area. Denote by g v (X) the probability of inconsistent sampling given that X [x v, x v+1 ]. Then, we express g v (X) as: g v (X) = P (x v+j + d < X + L x v+j+1 X [x v, x v+1 ]) = P (x j < L x v+j+1 X X [x v, x v+1 ]) L = F L (x v+j+1 X) F L (x j ). (18) Next, we relax the condition on X to obtain roundoff-error probability ρ j. Denote by f X (x) the PDF of arrival time X. Thus, we have: ρ j = v= xv+1 d x v g v (y)f X (y)dy. (19) It remains to derive the distribution f X (x) of arrival time X. Notice that X could be anywhere within [, W ]. We first consider the probability of a user arriving before time x in the first half of the window, i.e., F X (x) = P (X x). In fact, P (X x) is given by the ratio of the number of users arriving before x over the total number of users that appear in the first half of the window. Utilizing the result of λ(t) in (3), we obtain that: x F X (x) = λ(t)dt x W λ(u)du = f A(t )dt W f A (u )du. (2) The last step is to differentiate F X (x), which gives: f X (x) = f A (x ) W f A (u )du. (21) The desired result immediately follows from substituting (18) and (21) into (19).

6 Note that Theorem 1 generalizes the result developed in [24] to non-stationary systems. It is easy to verify that for stationary arrivals, i.e., f A (x) = 1/τ for x [, τ), the result in (17) becomes: ρ j = 1 xj+1 x j F L (x)dx F L (x j ), (22) which together with (16) gives the same expression for the CBM estimator as in [24]. We next investigate whether there exist cases that make CBM unbiased under the new churn model. Corollary 1: Under NS-PCM, the only lifetime distribution that allows CBM to be unbiased simultaneously for all > is exponential. Furthermore, as, probability ρ j and E C (x j ) F L (x j ), i.e., CBM becomes unbiased for any F L (x) and F A (x). Proof: We can prove the first statement by a) deriving the necessary and sufficient condition for CBM to be unbiased simultaneously for all > and b) showing that the exponential distribution is the only one that satisfies the condition. For a), we substitute E C (x j ) = F L (x j ) into (16) to establish that the necessary and sufficient condition is given by: ρ j = F L (x j )ρ, (23) where F L (x j ) = 1 F L (x j ) is the complementary CDF of lifetimes. For b), it is easy to verify that any exponential distribution satisfies (23). With F L (x) = 1 e x/µ, inconsistency probability ρ j becomes that: ρ j = v= = e x j/µ xv+1 x v v= ( e x j/µ e x v+j+1+y ) f A (y )dy xv+1 x v ( 1 e x v+1+y ) f A (y )dy = F L (x j )ρ. (24) Now, we prove that the only lifetime distribution satisfying (23) is exponential. Substituting (17) into (23) establishes: v= xv+1 x v = F L (x j ) = v= xv+1 v= (F L (x j + x v+1 y) F L (x j )) f A (y )dy xv+1 x v F L (x v+1 y)f A (y )dy x v FL (x j )F L (x v+1 y)f A (y )dy. (25) For (25) to hold for all >, we need to have: F L (u + v) F L (u) = F L (u)f L (v), (26) for any u > and v >. Note that (26) simplifies to F L (u + v) = F L (u) F L (v), to which the only solution is F L (x) = e x/µ. The first statement of this corollary thus follows immediately. The second statement of this corollary can be easily proved by showing that ρ j tends to zero as. From Taylor expansion, we have that: F L (x v+j+1 y) F L (x j ) = f L (x j )(x v+1 y) + Θ( 2 ) and f L (x j ) + Θ( 2 ), (27) F A (x v+1) F A (x v) = f A (x j ) + Θ( 2 ). (28) Thus, ρ j can be upper bounded as follows: ρ j v= (f L (x j ) + Θ( 2 ))(f A (x j ) + Θ( 2 )) W f A (u )du = W f L(x j )f A (x j ) + Θ( 2 ) W, (29) f A (u )du which indicates that ρ j as. Interestingly, CBM s conditions for removing bias did not change from those under stationary churn (and are still impossible to satisfy in practice), despite the fact that its bias in all other cases became a much more complex function of both F L (x) and F A (x). We next examine how RIDE is impacted by NS-PCM. C. ResIDual-based Estimator (RIDE) Wang et al. [24] proposed RIDE to address potential problems of overhead and bias in CBM. At time t, RIDE takes a snapshot of the whole system and records in set V R all users found to be alive. For all subsequent intervals j (j = 1, 2,..., W/ ) of time units, the algorithm keeps probing peers in set V R either until they die or W expires. After the observation window W is over, the algorithm collects the residual lifetimes of users in V R. Define E H (x j ) to be the empirical residual distribution based on sample set V R : E H (x j, t ) = N R (x j ), (3) N R N R where N R = V R is the number of acquired samples and N R (x) is the number of them no larger than x. Denote by E R (x j, t ) the RIDE estimator of F L (x) obtained using a single crawl at time t : E R (x j, t ) = 1 h(x j, t ) h(, t ), (31) where h(x, t ) is the numerical derivative of E H (x j, t ). To quantify the accuracy of (31), we must first determine how its companion E H (x j, t ) relates to F L (x). Notice that E H (x j, t ) measures the residual lifetime distribution of users alive at t. Specifically, denote by R(t) the actual remaining lifetime of a random user alive at time t and by H(x, t) = P (R(t) x) its CDF. Then, we immediately have the following result. Lemma 3: Under NS-PCM, E H (x j, t ) is an unbiased estimator of H(x, t ), i.e., E H (x j, t ) = H(x j, t ) for j = 1,..., W/. Proof: Notice that if a user is found to be alive at time t, its residual lifetime according to the definition is the time from

7 t till it dies. This observation is consistent with the sampling rule of RIDE as described above. Therefore, sample set V R contains instances of residuals R(t ) and estimator E H (x j, t ) computed from set V R gives the empirical distribution of R(t ). It follows that when the sample size V R, empirical distribution E H (x j, t ) converges to the actual distribution P (R(t ) x j ). Then, the problem of analyzing RIDE s accuracy reduces to deriving the residual distribution H(x, t ), which can be obtained by applying the lattice version of the Renewal- Reward Theorem [25, page 6] to point process {S i,k }. Theorem 2: Under NS-PCM, residual lifetime distribution H(x, t ) is a periodic function of time t for sufficiently large t : x H(x, t ) = 1 ω(z x, t )df L (z) ω(z, t )df, (32) L(z) where ω(x, u) for u [, τ) is given by: ω(x, u) = F A (u) F A (max(u x, )) + 1 F A (1 + min(u x, )) + b(x)/τ. (33) Proof: See Appendix I. Now, we are ready to derive what values RIDE s estimator E R (x j ) produces. Differentiating (32) and substituting the result into (31), we immediately establish the next corollary. Corollary 2: Under NS-PCM, RIDE estimator E R (x, t ) is a periodic function of time t for sufficiently large t : E R (x j, t ) = 1 x j ω(z x j, t )df L (z) ω(z, t )df, (34) L(z) where ω(.) is given in (33) and f L (x) = F L (x) is the PDF of user lifetimes. Proof: Differentiating both sides of (32), we obtain: h(x, t ) = H (x, t ) = x ω(z x, t )df L (z) ω(z, t )df, L(z) which combining with (31) establishes the desired result. Note from (34) that the RIDE estimator E R (x, t ) is a complex function of F L (x), arrival pattern F A (x), and initial sample time t. To make estimation possible out of this result, one requires either exponential lifetimes or stationary arrivals as shown next. Corollary 3: Under NS-PCM, RIDE is unbiased for all lifetime distributions iff the arrival pattern is uniform, i.e., f A (x) = 1/τ for x [, τ). Similarly, RIDE is unbiased for all arrival patterns iff F L (x) is exponential. Interestingly, sampling interval has no impact on the bias in (34), which means that no matter how fast RIDE samples the system, the bias cannot be einated (unlike in CBM, where it is actually possible). D. Simulations We now examine CBM and RIDE in simulations to show examples of their bias. In all simulations, we use τ = 24 hours and the arrival pattern F A (x) observed in the Gnutella network. We consider two lifetime distributions: 1) Pareto with F L (x) = 1 (1 + x/β) α, where shape α = 2 and scale β 1 1 2 1 4 Fig. 5. CBM actual 1 6 1 1 1 1 2 1 3 lifetime+β (hours) 1 1 2 1 4 (a) Pareto (E[L] = 3 hrs) 1.8.6.4 CBM estimator (15) under NS-PCM. 1 6 1 1 1 1 2 1 3 lifetime+β (hours) Fig. 6. (a) Pareto (E[L] = 3 hrs) RIDE actual.2 CBM actual 2 4 6 8 1 lifetime (hours).8.6.4.2 RIDE estimator (31) under NS-PCM. (b) periodic (E[L] = 38 hrs) 1 RIDE actual 2 4 6 8 1 lifetime (hours) (b) periodic (E[L] = 38 hrs) such that E[L] = 3 hours; 2) periodic L = J 1 + J 2, where J 1 is uniformly discrete among {, τ, 2τ, 3τ} and J 2 [, τ) is a truncated exponential random variable with mean 2 hours. The former case models users with heavy-tailed lifetimes, which is fairly standard in evaluating churn models [11], [26]. The latter case covers peers that leave their computers logged in for J 1 full days and then spend a random amount of time J 2 browsing the system on the last day before finally departing. Using sampling interval = 3 hours and n = 1 6 users, we apply CBM and RIDE to obtain the corresponding estimates of target distribution F L (x). We observe from simulations that both (16) and (34) are very accurate in predicting the errors of these methods. Due to ited space, we omit this discussion and instead focus on the actual bias. Fig. 5 shows that CBM s estimates clearly deviate from both target distributions. Even though smaller intervals (i.e., E[L]) can oftentimes reduce the bias in CBM to negligible levels, this improvement comes at the expense of a sharp increase in overhead. RIDE results for the Pareto case are shown in Fig. 6(a), whose deviation distance from F L (x) resembles that of CBM in Fig. 5(a). However, the periodic case in Fig. 6(b) produces completely different results. Not only is the shape of the estimated distribution completely different from that of F L (x), but the estimated values do not even represent a valid CDF function (i.e., E R (x j, t ) is non-monotonic in variable x j ). Increasing overhead (i.e., lowering ) in this case has no impact and RIDE remains biased regardless of manipulations to the sampling process. E. Discussion In summary, all existing methods suffer from bias under NS-PCM and, to be complete accurate, require either high overhead (i.e., ) or unrealistic assumptions (i.e., exponential lifetimes, stationary arrivals), which cannot be satisfied

8 in practice. In what follows, we seek a better solution by adapting residual sampling to remain robust under general non-stationary arrivals while preserving its advantage over CBM in terms of overhead. IV. U-RIDE This section generalizes RIDE by varying its sample point t uniformly within the period of the arrival process λ(t). The main issue is to decide the location of sampling points without knowing period τ and build a provably unbiased estimator from collected samples. In what follows, we first develop a general framework that can produce an unbiased estimator for F L (x) and then present an algorithm to implement it. Toward the end of this section, we validate the proposed algorithm in simulations and compare its traffic overhead to that of prior methods. A. General Framework Instead of just one snapshot at time t, assume that we can crawl the entire system at multiple time points t 1, t 2,..., t M, where M is the number of snapshots permitted by the overhead-accuracy tradeoff. For each snapshot m, we identify all live users and independently track their residuals using recurring probing every time units. We call set T M = {t 1, t 2,..., t M } a sampling schedule and set O M = {t 1, t 2,..., t M } an offset schedule. We further assume that all t m are within some snapshot window W S W, i.e., t m [t 1, t 1 + W S ] for all m. Definition 4: Schedule T M is called uniform if its offset schedule O M forms a realization of a uniform random variable in [, τ) as M. Given a uniform schedule T M, we present a sampling algorithm that can construct an unbiased estimator of target distribution F L (x). Algorithm 1: Assuming schedule T M is uniform, obtain a snapshot of the entire system at each time t m T M. For snapshot m, record the number of alive users N R (t m ) and the number of them N R (x, t m ) with residual lifetimes no larger than x. Then, output the following ratio for each x j : M r(m, x j ) = N R(x j, t m ) M N. (35) R(t m ) We make two comments on Algorithm 1. First, notice that at each time t m, the sampling process does not know the exact number of discovered users that have residual lifetime R(t m ) no greater than x. Therefore, values N R (x, t m ) remain unknown until the end of the measurement, at which time they are updated simultaneously for all m [1, M]. Second, it can be shown that if a user is alive during two snapshots at times t m and t j, it must be sampled at both instances as if these were two independent users. Doing otherwise leads to incorrect estimation and bias in the result. For brevity, we omit additional discussion of this issue and the corresponding simulations. The next theorem indicates that Algorithm 1 can be used to infer target distribution F L (x). Theorem 3: The output of Algorithm 1 under NS-PCM converges as following: E H(x j ) := r(m, x j) = 1 E[L] xj (1 F L (t))dt. (36) Proof: As before, we first reduce EH (x) to reward functions W i,k (θ) and R i,k (x, θ) and then derive their formulas. Notice from Algorithm 1 that N R (t m ) records the number of alive users at time t m and can be expressed using Z i (t m ): n N R (t m ) = Z i (t m ), (37) which leads to: M N R (t m ) = i=1 M i=1 n Z i (t m ). (38) Dividing both sides of (38) by product nm, it thus follows that: M N R (t m ) n M nm = Z i (t m ) nm = = n i=1 M i=1 n i=1 Z i (t m ) nm E[Z i (t 1 + Θ)]. (39) n The third equality of (39) comes from the fact that the offset schedule of {t m } M forms a realization of a uniform random variable in [, τ). From Assumptions 1-2, n users in the system are homogeneous, which leads to: M N R (t m ) nm = E[Z i(t 1 + Θ)]. (4) Now, we let t 1 and rewrite E[Z i (t 1 + Θ)] using conditional expectation: M N R (t m ) nm = E[E[Z i(t 1 + Θ)] Θ]. (41) t 1 Applying the Dominated Convergence Theorem to (41) establishes that: M N R (t m ) = E[ nm E[Z i(t 1 + Θ)] Θ] t 1 = E[ t 1 P (Z i(t 1 + Θ) = 1) Θ]. (42) We apply (65) to P (Z i (t 1 + Θ) = 1) and replace it with E[W i,k (Θ)]/E[η i,k ] in (42). Since E[η i,k ] does not depend on Θ, it thus follows that (42) can be reduced to: M N R (t m ) nm = E[E[W i,k(θ)] Θ]. (43) E[η i,k ] Notice from conditional expectation that for given Θ: E[W i,k (Θ)] = ω(z, Θ)dF L (z), (44)

9 where function ω(.) is given in (33). It follows that: M N R (t m ) nm = 1 E[η i,k ] E[ω(z, Θ)]dF L (z). (45) Note that Θ is uniformly distributed in [, τ), i.e., P (Θ θ) = x/τ for θ [, τ). It thus follows from (33) that: 1 1 2 1 4 U RIDE actual 1 6 1 1 1 1 2 1 3 lifetime+β (hours) (a) Pareto (E[L] = 3 hrs) 1.8.6.4.2 U RIDE actual 2 4 6 8 1 lifetime (hours) (b) periodic (E[L] = 38 hrs) E[ω(z, Θ)] = z, (46) Fig. 7. U-RIDE estimator (52) with BS under NS-PCM. which leads to: M Similarly, we have: M N R (x, t m ) nm = 1 E[η i,k ] N R (t m ) nm = E[L] E[η i,k ]. (47) E[ϕ(x, z, Θ)]dF L (z), (48) where function ϕ(.) is given in (75). It follows from (75) that for uniform Θ: which establishes: M E[ϕ(x, z, Θ)] = min(x, z), (49) N R (x, t m ) nm = 1 E[η i,k ] x Combining (47) and (5), we obtain: 1 r(m, x) = E[L] x (1 F L (z))dz, (5) (1 F L (z))dz, (51) which is the desired result. Taking the derivative of EH (x) in (36), we immediately obtain the desired result. Corollary 4: For all, the following is an unbiased estimator of F L (x): E R(x j ) = 1 h (x j ) h (), (52) where h (x) is the numerical derivative of EH (x). We call Algorithm 1 in combination with (52) Uniform ResIDual-based Estimator (U-RIDE) and examine how to implement it below. In the meantime, it is worth mentioning that performing RIDE sampling at uniformly randomized time points U [, τ) and then taking the expectation of the resulting CDF, i.e., E[H(x, U)], does not produce the same result as EH (x) in (36). According to our analysis, E[H(x, U)] is heavily dependent on the arrival pattern F A (x) and thus cannot be used to reconstruct F L (x). This observation distinguishes the new method from simply applying RIDE a number of times and averaging the result. B. Scheduling The last piece of our algorithm is to find a uniform schedule T M for Algorithm 1. We use a simple approach that we call Bernoulli Scheduling (BS). Suppose that the algorithm starts at time t 1 and the smallest sampling interval is as before. Then BS generates sequence T M using: t m+1 = t m + v m + u m, m 1, where v m is drawn from a geometric distribution with success probability p and u m is drawn from a uniform distribution within [, ). From the property of BASTA (Bernoulli Arrival See Time Average) [25], it is straightforward to show that the BS algorithm produces uniform schedules. Corollary 5: Sampling schedule T M generated by BS is uniform for any period τ. Notice that the expected duration of a BS schedule is given by M /p. Therefore, p can achieve both dense (i.e., large p) and sparse (i.e., small p) sampling. The former allows the sampling process to complete in a short time, while the latter spreads traffic overhead over time and thus avoids overloading network resources. In addition, while our analysis earlier in the section implicitly assumed that period τ was known, BS does not require this knowledge and thus can be used in a wide variety of periodic systems without any additional input. Next, we examine U-RIDE under NS-PCM using the same parameters as in Fig. 6. We set p =.5 and M = 24 in BS scheduling. Fig. 7 plots the lifetime distributions estimated from the output of Algorithm 1 along with the actual F L (x), indicating a very accurate match between the two. Other simulations with Weibull, discrete, uniform, and exponential lifetimes, as well as various arrival patterns F A (x), indicate that U-RIDE is extremely accurate. We omit them for brevity. C. Overhead We next study the question of how U-RIDE in its current shape compares to the other two methods in terms of overhead. To address this issue, we first derive a formula to show how U- RIDE compares to RIDE. We assume unit cost for contacting a Gnutella peer, which makes traffic overhead directly equal to the number of users contacted during the sampling process. Denote by c j the number contacts made at the j-th step of the sampling process for j = 1, 2,..., W/. Then, define B A to be the sampling overhead of an algorithm A: B A = W/ j=1 c j. (53)

1 We use B C to represent the overhead of CBM, B U that of U-RIDE, and B R that of RIDE. Define q xy = B x /B y to be one of the overhead ratios of interest, where x, y {C, U, R}. Theorem 4: Assume BS scheduling with p and M and that U-RIDE starts at midnight. Then, overhead ratio q UR is given by: q UR = 1 + τ E[L] M m=2 ym y m 1 f A (t )(1 F L (y m t))dt, (54) where y m = m (1 p)/p, τ is the period of the arrival process, E[L] is the expected user lifetime, f A (x) is the PDF of arrivals, and F L (x) is the CDF of lifetimes. Proof: Note that both RIDE and U-RIDE use residual sampling, which keeps probing each discovered alive user until it dies or window T expires. Denote by V x the probing set obtained by residual sampling algorithm x, where x {RIDE, U-RIDE}. Then, the overhead B x of residual sampling x is proportional to the product of the probing set size and the expected residual lifetime, B x = V x E[R]. Since expected residual lifetime E[R] is the same for RIDE and U- RIDE, we only need to compare the probing set size V x of these two methods. Notice that probing set size V R of RIDE equals to the size of the first snapshot made by U-RIDE at time t 1 : V R = ne[l], (55) where is the expected inter-arrival delay. Note that (55) is simply the average number of alive users in steady-state. Therefore, we only need to count the number of new residual samples discovered by U-RIDE at time points t 2,..., t M. Denote by N m the number of new residual samples found at time t m for m = 2,..., M. Further note that we do not make multiple probes of the same user when it has multiple residual samples according to our algorithm. It thus follows that N m is the number of users that arrive during interval [t m 1, t m ) and live through time t m : N m = tm = nτ t m 1 λ(t )(1 F L (t m t))dt tm t m 1 f A (t )(1 F L (t m t))dt. (56) Then, we obtain the probing set size V U of U-RIDE: V U = ne[l] = ne[l] + M m=2 + nτ N m M m=2 tm t m 1 f A (t )(1 F L (t m t))dt. (57) Notice from the definition of BS that t m can be modeled by Y, where Y is a negative binomial random variable NegBin(m, p). We thus approximate t m with its expectation y m E[t m ] = m (1 p)/p in (57). It follows from (55) α W q CU ɛ U =.1 ɛ U =.1 ɛ U =.1 1.1 48 hrs 5.7 5 213 72 hrs 6 54 274 96 hrs 6.2 57 322 2 48 hrs 19 92 151 72 hrs 25 13 222 96 hrs 31 166 292 TABLE I OVERHEAD RATIO q CU USING UNIFORM ARRIVALS, PARETO LIFETIMES WITH SHAPE α, E[L] = 1 HOUR, = 3 MINUTES, AND U-RIDE WITH M = 8 AND p = 1/6. and (57) that overhead ratio q UR is given by: q UR = B U = V U B R V R = 1 + τ E[L] M m=2 ym y m 1 f A (t )(1 F L (y m t))dt, (58) which is exactly (54). The result in (54) shows that q UR is a function of M,, and p. Under uniform arrivals, (54) becomes: q UR = 1 + (M 1)H( /p), (59) where H(x) = 1 x E[L] (1 F L(u))du is the CDF of residual lifetimes in stationary systems. Notice that overhead ratio q UR is an increasing function of M for constant > and p and tends to M as p or. This observation motivates us to seek a more efficient way to execute U-RIDE. D. Subsampling Next, we propose a subsampling technique aimed at reducing the overhead of U-RIDE. In Algorithm 1, we apply ɛ-subsampling as follows: for each discovered user, toss an unfair coin with success probability ɛ to decide whether the sample should be kept (i.e., added to both N R (t m ) and N R (x, t m )) or discarded. This approach reduces measurement traffic by approximately a factor of 1/ɛ. Using simple renewalprocess arguments, it can be shown that subsampling does not affect the properties of users collected by U-RIDE and has no effect on its ability to avoid bias. In order to select ɛ, notice that U-RIDE (as described above) obtains many more residual samples than RIDE, most of which are not necessary for accurate estimation. As long as the total number of samples M i=1 N R(t m ) is above some threshold, U- RIDE will converge by the law of large numbers. Therefore, keeping the same number of snapshots M, but reducing the size of each snapshot, U-RIDE can match the overhead of RIDE without sacrificing accuracy. Denote by V R and V U the original sample sets of RIDE and U-RIDE, respectively. Further, define ɛ R and ɛ U to be the corresponding subsampling factors. The following theorem ensures that U-RIDE with can be as efficient as RIDE. Theorem 5: Assuming ɛ R V R = ɛ U V U, the overhead of U-RIDE is upper bounded by that of RIDE for all, i.e., q UR 1.

11 Country Samples Percentage Unique IPs ISP Samples Percentage Unique IPs United States 12.8M 48.3% 21M FDC Servers 21.5M 8.6% 3.6M Brazil 35.7M 14.3% 6.4M Level 3 18.2M 7.3% 3M Canada 16M 6.4% 2.6M Tele. Santa Catarina 11.3M 4.5% 2.1M United Kindom 13.3M 5.3% 2M Tele. Bahia 8.7M 3.5% 1.5M Germany 6M 2.4% 1M SBC 8.2M 3.3% 1.3M Australia 5M 2%.93M Verizon 6.2M 2.5% 1M Japan 4.6M 1.9%.91M Tele. Sao Paulo 5.5M 2.2%.96M Netherlands 4.5M 1.8%.87M Shaw 4.8M 1.9% 9.3M Poland 4.4M 1.7%.82M Cablevision 4.1M 1.6%.76M Austria 4.3M 1.7%.7M Cox 4.M 1.6%.72M TABLE II NUMBER OF LIFETIME SAMPLES IN THE TOP-1 SUBSETS OF COUNTRY AND ISP. Proof: It is easy to prove the statement of this theorem by the fact that both RIDE and U-RIDE use residual sampling. Note that after discovering an alive user, residual sampling keep probing the user until it dies or window T expires. Therefore, the overhead of U-RIDE is the same as that of RIDE as long as the number of residual samples discovered by them are the same, which is guaranteed by ɛ R V R = ɛ U V U. Note that U-RIDE might have several samples from the same user. It thus follows that U-RIDE might have less overhead than RIDE given the condition ɛ R V R = ɛ U V U. As network size n, one can always choose ɛ U (n) 1/n such that ɛ U (n) V U remains constant at some predetermined threshold needed to invoke the law of large numbers. With this modification, U-RIDE retains the overhead advantages of RIDE compared to CBM and better scales to larger systems as shown in Table I for small ɛ U. V. EXPERIMENTS In this section, we compare U-RIDE with RIDE based on Gnutella measurements. In what follows, we first introduce our data collection process, then discuss comparison methodology, and finally present our results. A. Dataset Gnutella [6] is a popular peer-to-peer file sharing network that organizes users into a two-tier overlay structure. Each peer is identified by its (IP address, port) pair and can serve in one of two roles: ultrapeer or leaf. The former type of users connect to other ultrapeers to form the Gnutella overlay and route search messages between each other to find content. The latter type of users attach to a handful of ultrapeers and do not provide any routing services to other members of the system. Note that Gnutella has no central administration and its global structure at any given time is hidden from the user. Leveraging the crawl option supported in Gnutella/.6, our crawler requests neighbors of each visited ultrapeer and runs a BFS-like algorithm to capture snapshots of the entire system at different times t m. In a continuous experiment that lasted W = 7 days during June 14-2, 27, we performed repeated crawls of Gnutella every = 3 minutes, which approximated the behavior of CBM and provided enough data to emulate both U-RIDE and RIDE using offline processing. The dataset recorded over 25M user instances (36.9M ultrapeers and 1 F L (x) 1 1 1 1 2 Fig. 8. RIDE. powerfit: α = 1.152 β =.6979 CBM powerfit 1 3 1 1 1 1 1 1 2 lifetime x+β (hours) (a) CBM with a power fit 1 F L (x) 1 1 1 1 2 RIDE CBM 1 3 1 1 1 1 1 1 2 lifetime x+β (hours) (b) RIDE Estimated lifetime distribution of all observed peers using CBM and 219.1M leaves) from 5.5M unique IPs. Due to the dynamic nature of ports and IPs, we were unable to determine the total number of unique peers that participated in the system; however, the average number of concurrent users during this period has stayed close to 6.5M. We also split the dataset based on two criteria: geographic location and service provider. Table II lists the numbers of samples and their percentages along with unique IPs of the top-1 subsets in both categories. We observe from the table that while the collected samples concentrate in a few countries with almost 5% from US, the distribution of users among service providers is much more even with all ISPs receiving less than 1% of the samples. B. Comparison Methodology To compare U-RIDE with RIDE, we first need to obtain F L (x) as ground-truth. While this task is impossible with absolute accuracy, our earlier results (see Corollary 1) have shown that CBM has a diminishing bias under NS-PCM as. In particular, this condition can often be assumed to hold when E[L] (simulations omitted for brevity), which is satisfied in our crawls given E[L] 2 hours. We processed the dataset with all observed peers using CBM after discarding 3.4M invalid samples, but RIDE uses the original dataset. Fig. 8(a) plots the resulting distribution on a log-log scale along with a power-law fit, which indicates that lifetimes of Gnutella users follow a power-law distribution with shape α = 1.15 and β =.69, which is consistent with the result in [2] and other prior papers. With the data collected from CBM sampling, we are now ready to compare the other two methods.