Real-time Classification of Large Data Sets using Binary Knapsack

Similar documents
Fundamentals of Speech Recognition Suggested Project The Hidden Markov Model

The Linear Regression Of Weighted Segments

14. Poisson Processes

Optimal Eye Movement Strategies in Visual Search (Supplement)

The Poisson Process Properties of the Poisson Process

8. Queueing systems lect08.ppt S Introduction to Teletraffic Theory - Fall

θ = θ Π Π Parametric counting process models θ θ θ Log-likelihood: Consider counting processes: Score functions:

The Mean Residual Lifetime of (n k + 1)-out-of-n Systems in Discrete Setting

Fault Tolerant Computing. Fault Tolerant Computing CS 530 Probabilistic methods: overview

Cyclone. Anti-cyclone

Least squares and motion. Nuno Vasconcelos ECE Department, UCSD

Key words: Fractional difference equation, oscillatory solutions,

QR factorization. Let P 1, P 2, P n-1, be matrices such that Pn 1Pn 2... PPA

IMPROVED PORTFOLIO OPTIMIZATION MODEL WITH TRANSACTION COST AND MINIMAL TRANSACTION LOTS

Continuous Time Markov Chains

As evident from the full-sample-model, we continue to assume that individual errors are identically and

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

(1) Cov(, ) E[( E( ))( E( ))]

Density estimation III. Linear regression.

Least Squares Fitting (LSQF) with a complicated function Theexampleswehavelookedatsofarhavebeenlinearintheparameters

Reliability Analysis of Sparsely Connected Consecutive-k Systems: GERT Approach

General Complex Fuzzy Transformation Semigroups in Automata

Survival Prediction Based on Compound Covariate under Cox Proportional Hazard Models

Moments of Order Statistics from Nonidentically Distributed Three Parameters Beta typei and Erlang Truncated Exponential Variables

Midterm Exam. Tuesday, September hour, 15 minutes

Density estimation III.

Determination of Antoine Equation Parameters. December 4, 2012 PreFEED Corporation Yoshio Kumagae. Introduction

Learning of Graphical Models Parameter Estimation and Structure Learning

EE 6885 Statistical Pattern Recognition

FALL HOMEWORK NO. 6 - SOLUTION Problem 1.: Use the Storage-Indication Method to route the Input hydrograph tabulated below.

Fully Fuzzy Linear Systems Solving Using MOLP

RATIO ESTIMATORS USING CHARACTERISTICS OF POISSON DISTRIBUTION WITH APPLICATION TO EARTHQUAKE DATA

Available online Journal of Scientific and Engineering Research, 2014, 1(1): Research Article

Real-Time Systems. Example: scheduling using EDF. Feasibility analysis for EDF. Example: scheduling using EDF

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Continuous Indexed Variable Systems

EE 6885 Statistical Pattern Recognition

Development of Hybrid-Coded EPSO for Optimal Allocation of FACTS Devices in Uncertain Smart Grids

Density estimation. Density estimations. CS 2750 Machine Learning. Lecture 5. Milos Hauskrecht 5329 Sennott Square

The algebraic immunity of a class of correlation immune H Boolean functions

Multiple Choice Test. Chapter Adequacy of Models for Regression

Lecture 3 Topic 2: Distributions, hypothesis testing, and sample size determination

Solving fuzzy linear programming problems with piecewise linear membership functions by the determination of a crisp maximizing decision

COMPARISON OF ESTIMATORS OF PARAMETERS FOR THE RAYLEIGH DISTRIBUTION

Comparison of the Bayesian and Maximum Likelihood Estimation for Weibull Distribution

Mathematical Formulation

To Estimate or to Predict

Solution. The straightforward approach is surprisingly difficult because one has to be careful about the limits.

Some Probability Inequalities for Quadratic Forms of Negatively Dependent Subgaussian Random Variables

Continuous Distributions

The Bernstein Operational Matrix of Integration

Solution set Stat 471/Spring 06. Homework 2

The textbook expresses the stock price as the present discounted value of the dividend paid and the price of the stock next period.

Pricing Asian Options with Fourier Convolution

EE 6885 Statistical Pattern Recognition

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

-distributed random variables consisting of n samples each. Determine the asymptotic confidence intervals for

Quantum Mechanics II Lecture 11 Time-dependent perturbation theory. Time-dependent perturbation theory (degenerate or non-degenerate starting state)

Asymptotic Behavior of Solutions of Nonlinear Delay Differential Equations With Impulse

PTAS for Bin-Packing

Linear Regression Linear Regression with Shrinkage

Assessing Normality. Assessing Normality. Assessing Normality. Assessing Normality. Normal Probability Plot for Normal Distribution.

Redundancy System Fault Sampling Under Imperfect Maintenance

Chapter 8. Simple Linear Regression

For the plane motion of a rigid body, an additional equation is needed to specify the state of rotation of the body.

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 12 Dec. 2016, Page No.

The ray paths and travel times for multiple layers can be computed using ray-tracing, as demonstrated in Lab 3.

The Signal, Variable System, and Transformation: A Personal Perspective

FORCED VIBRATION of MDOF SYSTEMS

4. THE DENSITY MATRIX

Random Variables and Probability Distributions

AML710 CAD LECTURE 12 CUBIC SPLINE CURVES. Cubic Splines Matrix formulation Normalised cubic splines Alternate end conditions Parabolic blending

STK4011 and STK9011 Autumn 2016

Solving Fuzzy Equations Using Neural Nets with a New Learning Algorithm

Quantitative Portfolio Theory & Performance Analysis

arxiv: v1 [stat.ml] 21 Mar 2017

Mixed Integral Equation of Contact Problem in Position and Time

RELIABILITY AND CREDIT RISK MODELS

Enhanced least squares Monte Carlo method for real-time decision optimizations for evolving natural hazards

Interval Estimation. Consider a random variable X with a mean of X. Let X be distributed as X X

Solution of Impulsive Differential Equations with Boundary Conditions in Terms of Integral Equations

The Properties of Probability of Normal Chain

JORIND 9(2) December, ISSN

The textbook expresses the stock price as the present discounted value of the dividend paid and the price of the stock next period.

The Optimal Combination Forecasting Based on ARIMA,VAR and SSM

Some probability inequalities for multivariate gamma and normal distributions. Abstract

Lecture 02: Bounding tail distributions of a random variable

AN INCREMENTAL QUASI-NEWTON METHOD WITH A LOCAL SUPERLINEAR CONVERGENCE RATE. Aryan Mokhtari Mark Eisen Alejandro Ribeiro

Regression Approach to Parameter Estimation of an Exponential Software Reliability Model

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Common MidPoint (CMP) Records and Stacking

Chapter 14 Logistic Regression Models

Evolutionary Method of Population Classification According to Level of Social Resilience

Department of Economics University of Toronto

On Metric Dimension of Two Constructed Families from Antiprism Graph

arxiv: v2 [cs.lg] 19 Dec 2016

Multiphase Flow Simulation Based on Unstructured Grid

Model for Optimal Management of the Spare Parts Stock at an Irregular Distribution of Spare Parts

means the first term, a2 means the term, etc. Infinite Sequences: follow the same pattern forever.

NP!= P. By Liu Ran. Table of Contents. The P versus NP problem is a major unsolved problem in computer

Transcription:

Real-me Classfcao of Large Daa Ses usg Bary Kapsack Reao Bru bru@ds.uroma. Uversy of Roma La Sapeza AIRO 004-35h ANNUAL CONFERENCE OF THE ITALIAN OPERATIONS RESEARCH Sepember 7-0, 004, Lecce, Ialy

Oule The Daa Classfcao Problem Barzao of Daa Records Evaluao of Alerave Barzaos Seleco of Bary Arbues Compuaoal Resuls

Daa Classfcao Fdg models ha descrbe ad dsgush classes or coceps for fuure predco More specfcally: gve a rag se S of daa already paroed classes S ad S -, predc whch class each ew daa belogs o A supervsed learg problem Wh he feld of daa mg : exraco of eresg o-rval, mplc, prevously ukow ad poeally useful formao or paers from daa large daabases We deal wh massve daa ses wh real me requremes

Daa Records Daa are srucured o records. A record scheme s a se of felds R = { f f m } A record sace s a se of values r = { v v m } Each feld f has s doma D : he se of every possble value v, cludg errors. Example: for records represeg persos, feld ca be age, maral saus, correspodg values ca be 8, sgle, correspodg domas ca be Z U {blak}, {sgle, marred, separae, dvorced, wdow, blak}.

Felds ad Arbues Felds ca be: umercal or quaave caegorcal or qualave couous: real-valued dscree: eger or bary ordered o ordered Geerally, classfcao procedures requre a coverso of all felds f o bary oes. They wll here be called arbues a {0,} f {a a } R b = {a a a m a m m }

Barzao A basc ad maly used barzao procedure s he dervao of cu-pos e.g. LAD [Boros-Hammer-Ibarak-Koga]: gve r r such ha her values o feld f are separaed by o oher record, D r r - derve a cu-po α = [v r v r - ] / Cu-po are used o geerae arbues: above or below α May cupos are obaable. For each group of hem, we have a barzao. May alerave barzaos are possble Selecg he bes oe s a combaoral opmzao problem

Example S S - Wegh 90 00 75 05 70 Hegh 95 05 80 90 75 Class yes yes yes o o Classfcao baskeball players ad o baskeball players Wegh 70 75 90 00 05 - - 7.5 0.5 Hegh 75 80 90 95 05 - - 77.5 85 9.5 Possble Barzaos: usg all cu pos bad 7.5, 0.5 ad 77.5 3 7.5 ad 0.5 4

Evaluao of Cu-pos We vesgae a crero for evaluag he qualy of a barzao wh a fas procedure. We evaluae each sgle cupo usg he rag se. They ca be dffere suaos: dsrbuo of dsrbuo of - a - - - α a α a 3 α a 4 α a 5 α a 6 α a dsrbuo of dsrbuo of - b - - - α b α b 3 α b 4 α b 5 α b dsrbuo of dsrbuo of - c - - - α c

Probably Issues The odds of gvg correc posve [egave] classfcao usg α q = l Pr class α Pr class α. Pr class α Pr class α [0, We wa a evaluao of each cu-po ha ca be summed The probably of a couco of eves s a produc Therefore, we cosder he logarhm obag a sum Le N N- be he real ukow classfcao ad A A- be he supposed classfcao kow By defo of probably, N A Pr class α = N Therefore, = l N N A q. A N N A A

Felds wh Normal Dsrbuo We do o kow N N-, bu for felds wh a Normal Gaussa dsrbuo we guess were hey are by usg S ad S- as follows: We hypohesze hs for all couous felds ad dscree felds wh large umber of values hypohess s esable We compue mea value m m - ad devao σ σ - from S S- ad for a raso from o we have: = d e d e d e d e q m m m m α σ α σ α σ α σ σ π σ π σ π σ π. l

Felds wh Bomal Dsrbuo We do o kow N N-, bu for felds wh a Bomal Beroull dsrbuo we guess were hey are by usg S ad S- as follows: We hypohesze hs for all dscree felds ad ordered caegorcal felds hypohess s esable We compue umber of values - ad probably of success p p - from S S- ad for a raso from o we have: = = = = = 0 0!!!!!!.!!!!!! l m m m m p p p p p p p p q α α α α

Felds wh Ukow Dsrbuo For felds havg ukow dsrbuo: caegorcal felds, or felds where oher hypohess are o verfed, we smply replace N N- wh S S-, ad compue he qualy as follows = l S A q. S A S S A A Whe o dsrbuo hypohess ca be doe, we are fac uable o guess were he posve ad egave pos o he rag se should be

The Kapsack Model Now we eed o choose he bes barzao We assocae o cu-pos bary varables x = f α s used 0 oherwse I early real-me applcao, we ca compue he umber b of arbues we ca deal wh reasoable me max, p, x q x b x {0,} If all p are, hs kapsack becomes a easy problem: a greedy heursc fds he opmal soluo

Classfcao Oce daa are barzed, he acual classfcao sep s performed usg he followg weghed sum, where P s he se of arbues gvg a posve classfcao N s he se of arbues gvg a egave classfcao r =, P w a r, N w a r 0 r s classfed < 0 r s classfed - weghs w for he arbues are posve [egave] values proporoal o he cardaly of he par of S [ S-] coaed such arbues

Compuaoal Resuls The algorhm s mplemeed C ad esed o he larges daases wh bary classfcao of he UCI reposory: hp://www.cs.uc.edu/mlear/mlreposory.hml spam e-mal, adul, germa cred, musk, pma das Tess usg 0%, 5% ad 30% of daase as rag se Each resul s average o 5 rals wh radom seleco of he rag se

Spam E-mal Daase Classfy emal spam or o: 460 records each havg 58 umercal felds 55 real dscree 0% 5% 30% Accuracy 96.74 97.3 96.53 Tme sec. 0.78 0.8 0.85 Bes leraure: comparable 97 98% wh much larger rag se 50% ad more me ~0x

Adul Daase Decde wheher aual come > 50,000 $ : 45 mssg removed records each havg 5 felds 6 real 8 caegorcal 0% 5% 30% Accuracy 76.73 75.96 76.78 Tme sec. 3.4 3.53 3.7 Bes leraure: moderaely beer 85 86% wh much larger rag se 75% ad more me ~0x

Salog - Germa Cred Daase Classfy good ad bad credors: 000 records each havg 0 felds 7 umercal 3 caegorcal 0% 5% 30% Accuracy 63.88 68.89 65.7 Tme sec. 0.7 0.7 0.7 Bes leraure: moderaely beer 75% wh much larger rag se 60% ad much more me ~50x

Musk clea Daase Classfy molecules musk or o: 6598 records each havg 67 real felds 0% 5% 30% Accuracy 86.00 86.47 88.94 me.75.76.96 Bes leraure: slghly beer 9% wh much larger rag se 50% ad more me ~0x

Pma Idas Daase Classfy paes dabec or o: 768 records each havg 8 umercal felds real 6 dscree 0% 5% 30% Accuracy 65.9 65.9 66.30 Tme sec. 0.05 0.03 0.04 Bes leraure: moderaely beer 70 75% wh much larger rag se 50% ad much more me ~00x

Coclusos The proposed approach classfes exremely shor mes large daases, obag a good accuracy ad usg very reduced rag ses Compuaoal me may be decded advace, hece he procedure s suable for dealg wh real-me requremes Accuracy creases wh he dmeso of he rag se ul a cera dmeso. Whe furher creasg he rag se, hece he graulary of he barzao, logcal combaos of arbues become useful e.g. LAD