ScienceDirect. A SVM Stock Selection Model within PCA

Similar documents
An Introduction to. Support Vector Machine

Research on SVM Prediction Model Based on Chaos Theory

Kernel-based Methods and Support Vector Machines

Study on a Fire Detection System Based on Support Vector Machine

Binary classification: Support Vector Machines

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Dimensionality reduction Feature selection

Support vector machines II

Support vector machines

Introduction to local (nonparametric) density estimation. methods

ESS Line Fitting

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

PROJECTION PROBLEM FOR REGULAR POLYGONS

Principal Components. Analysis. Basic Intuition. A Method of Self Organized Learning

Functions of Random Variables

MULTIDIMENSIONAL HETEROGENEOUS VARIABLE PREDICTION BASED ON EXPERTS STATEMENTS. Gennadiy Lbov, Maxim Gerasimov

Supervised learning: Linear regression Logistic regression

Rademacher Complexity. Examples

Analysis of Lagrange Interpolation Formula

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Nonlinear Blind Source Separation Using Hybrid Neural Networks*

A New Method for Decision Making Based on Soft Matrix Theory

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. Research on scheme evaluation method of automation mechatronic systems

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Radial Basis Function Networks

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

QR Factorization and Singular Value Decomposition COS 323

Bayes (Naïve or not) Classifiers: Generative Approach

TESTS BASED ON MAXIMUM LIKELIHOOD

to the estimation of total sensitivity indices

Journal of Chemical and Pharmaceutical Research, 2014, 6(7): Research Article

A Robust Total Least Mean Square Algorithm For Nonlinear Adaptive Filter

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

7.0 Equality Contraints: Lagrange Multipliers

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

0/1 INTEGER PROGRAMMING AND SEMIDEFINTE PROGRAMMING

PGE 310: Formulation and Solution in Geosystems Engineering. Dr. Balhoff. Interpolation

Econometric Methods. Review of Estimation

G S Power Flow Solution

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

13. Parametric and Non-Parametric Uncertainties, Radial Basis Functions and Neural Network Approximations

Combining Gray Relational Analysis with Cumulative Prospect Theory for Multi-sensor Target Recognition

Lecture 07: Poles and Zeros

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

Statistics: Unlocking the Power of Data Lock 5

Generating Multivariate Nonnormal Distribution Random Numbers Based on Copula Function

A handwritten signature recognition system based on LSVM. Chen jie ping

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Unsupervised Learning and Other Neural Networks

Generalization of the Dissimilarity Measure of Fuzzy Sets

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

ε. Therefore, the estimate

Lecture Notes Types of economic variables

Block-Based Compact Thermal Modeling of Semiconductor Integrated Circuits

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

1 Lyapunov Stability Theory

A Method for Damping Estimation Based On Least Square Fit

MOLECULAR VIBRATIONS

LINEARLY CONSTRAINED MINIMIZATION BY USING NEWTON S METHOD

An Improved Support Vector Machine Using Class-Median Vectors *

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

COMPROMISE HYPERSPHERE FOR STOCHASTIC DOMINANCE MODEL

To use adaptive cluster sampling we must first make some definitions of the sampling universe:

Regression and the LMS Algorithm

Derivation of 3-Point Block Method Formula for Solving First Order Stiff Ordinary Differential Equations

Beam Warming Second-Order Upwind Method

Chapter 9 Jordan Block Matrices

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

ABOUT ONE APPROACH TO APPROXIMATION OF CONTINUOUS FUNCTION BY THREE-LAYERED NEURAL NETWORK

4. Standard Regression Model and Spatial Dependence Tests

A new type of optimization method based on conjugate directions

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Lecture 7: Linear and quadratic classifiers

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

A New Family of Transformations for Lifetime Data

A Penalty Function Algorithm with Objective Parameters and Constraint Penalty Parameter for Multi-Objective Programming

Analysis of Variance with Weibull Data

Application of Improved Grey Correlative Method in Safety Evaluation on Fully Mechanized Mining Faces

Gender Classification from ECG Signal Analysis using Least Square Support Vector Machine

ANALYSIS ON THE NATURE OF THE BASIC EQUATIONS IN SYNERGETIC INTER-REPRESENTATION NETWORK

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

L5 Polynomial / Spline Curves

Application of Calibration Approach for Regression Coefficient Estimation under Two-stage Sampling Design

Dimensionality Reduction and Learning

Arithmetic Mean and Geometric Mean

Multi Objective Fuzzy Inventory Model with. Demand Dependent Unit Cost and Lead Time. Constraints A Karush Kuhn Tucker Conditions.

TRIANGULAR MEMBERSHIP FUNCTIONS FOR SOLVING SINGLE AND MULTIOBJECTIVE FUZZY LINEAR PROGRAMMING PROBLEM.

Lecture 3 Probability review (cont d)

PTAS for Bin-Packing

Lecture 8: Linear Regression

Newton s Power Flow algorithm

The Necessarily Efficient Point Method for Interval Molp Problems

n -dimensional vectors follow naturally from the one


NP!= P. By Liu Ran. Table of Contents. The P vs. NP problem is a major unsolved problem in computer

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

About a Fuzzy Distance between Two Fuzzy Partitions and Application in Attribute Reduction Problem

Transcription:

Avalable ole at www.scecedrect.com SceceDrect Proceda Computer Scece 31 ( 2014 ) 406 412 2d Iteratoal Coferece o Iformato echology ad Quattatve Maagemet, IQM 2014 A SVM Stock Selecto Model wth PCA Huahua Yu a, Rogda Che b,, Guopg Zhag c a School of Face, Zhejag Uversty of Face & Ecoomcs, Hagzhou, 310018, Cha b School of Face, Zhejag Uversty of Face & Ecoomcs, Hagzhou, 310018, Cha c School of Ecoomcs ad Iteratoal rade, Zhejag Uversty of Face & Ecoomcs, Hagzhou, 310018, Cha Abstract I the facal market, well-performg stocks usually have some specfc features facal fgures. hs paper troduces a mache learg method of support vector mache to costruct a stock selecto model, whch ca do the olear classfcato of stocks. However, the accuracy of SVM classfcato s very sestve to the qualty of trag set. o avod the drect use of complcated ad hghly dmesoal facal ratos, we brg the prcpal compoet aalyss (PCA) to SVM model to extract the low-dmesoal ad effcet feature formato, whch mproves the trag accuracy ad effcecy as well as preserve the features of tal data. As emprcal results show, based o support vector mache, wth PCA after ormstadardzato, the stock selecto model acheves the etre accuracy of 75.4464% trag set ad of 61.7925% test set. Further, the PCA-SVM stock selecto model cotrbutes the aual eargs of stock portfolo to outperformg those of A- share dex of Shagha Stock Exchage, sgfcatly. 2014 Publshed by Elsever B.V. Ope access uder CC BY-NC-ND lcese. 2014 he Authors. Publshed by Elsever B.V. Selecto ad peer-revew uder resposblty of the Orgazg Commttee of IQM 2014. Selecto ad peer-revew uder resposblty of the Orgazg Commttee of IQM 2014. Keywords: mache learg; stock selecto; prcpal compoets aalyss; support vector mache 1. Itroducto Stock has always bee oe of the most popular vestmet strumets facal markets. Ivestors ad researchers are devotg themselves to study out a method that ca select accurately the stocks wth favorable future Correspodg author. el.: +860571-85750010; fax: +860571-85212001. E-mal address: rogdache@163.com. 1877-0509 2014 Publshed by Elsever B.V. Ope access uder CC BY-NC-ND lcese. Selecto ad peer-revew uder resposblty of the Orgazg Commttee of IQM 2014. do: 10.1016/j.procs.2014.05.284

Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 407 retur to be costtuets of vestmet portfolo. Guo ad Zhag 1, Kuo et al. 2 ad sumato et al. 3 develops several method to forecast stock prces or pck qualfed oes from large sample. However, some tradtoal stock selecto models usually face challeges whe dealg wth hghly dmesoal ad olear sample data for the reaso that stock selecto s a kd of determato wth mult objectves ad mult restrctos, alog wth the hghly dmesoal ad huge facal data. he mache learg-based theory, Artfcal Neural Network (ANN), ca capture the regular patters hdde behd the complex ad hgh-dmeso data through ts mache learg 4,5. Although ANN performs better tha tradtoal methods, t has lots of defects at the same tme, such as the dffculty to determe etwork structures, the problem wth local mmum pots ad the over-fttg. Vapk 6 proposed a ew mache learg-based method called Support Vector Mache (SVM), whch ca better hadle the hghdmeso data avodg the defects of ANN. SVM apples wdely may felds because of ts partcular advatages. A lot of researches, domestc ad abroad, use SVM to predct stock prces or reversal pots, as Yeh et al. 7 ad Huag 8. But t s seldom to establsh a stock selecto model by SVM, ad specfcally rare domestc. hs paper apples SVM to domestc stock market to establsh a effectve selecto model. We treat facal ratos of lsted compaes A-share of Shagha Exchage as orgal data, ad the use the prcpal compoets aalyss (PCA) to preprocess them. Frst, we establshed a stock selecto model (PCA-SVM) that recogzes hghretur stocks whe utlzed SVM theory to tra the trag set. Secod, apply PCA-SVM o test set to forecast the hgh-retur stocks the ext year ad do a comparso betwee the forecast ad the actual to llustrate effectveess of the establshed stock selecto model. 2. Prcpal compoets aalyss (PCA) Facal ratos of a lsted compay clude earg ablty, growth ablty, solvecy ablty ad so o. Each ablty cotas may sub-ratos. If all the ratos were used as puts the trag set, t would result redudacy ad low effcecy; eve decrease the qualty of emprcal results. New varables ca be created through trasformato of orgal varables. Number of varables s less ad most formato s stll retaed. hese ew varables are called prcpal compoets. 2.1. Defto of prcpal compoets Prcpal compoets ca be expressed as follows: Y1 1 X 11X112X2 1 X Y2 2 X 21X122X2 2X Y X X X X 1 12 2, (1) where X s the orgal varable, Y s the prcpal compoet ad s the coeffcet vector respectvely. ca be estmated by maxmzg Var( Y ) wth the costrat codtos of 1 ad Cov( Y, Yj) 0, j 1,2,, 1, where ( j ) s the covarace matrx of X. 2.2. Selecto of prcpal compoets he covarace matrx of X ( X1, X2,, X ), ( j ), s a symmetrc o-egatve defte matrx. herefore t has characterstc roots 1, 2,,, ad characterstc vectors. Suppose 1 2 0 ad the orthogoal ut egevectors are e 1, e 2,, e. he th prcpal compoet of X1, X2,, X ca be expressed as follows:

408 Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 Y e 1X1e2X2 ex, 1,2,, (2) wth Var( Y) e e ad Cov( Y, Yj) e e 0, j. he frst p prcpal compoets accumulated corbuto rate s p ACR( p) / (3) 1 1 whch represets the explaato power for orgal data of the prcpal compoets extracted by PCA method. Geerally, a ACR of 85% s at least requred, or the PCA method would be thought as usutable for losg too much orgal formato. Sce the covarace matrx s sestve to the order of magtudes of data, we eed to stadardze the data frst. here are two method of stadardzato commo use: Norm-stadardzato: X ( X X j )/ s, X j s the mea ad s j s the stadard devato. Mea-stadardzato: 3. Support vector mache j j j X X X, X j s the mea. j 3.1. Lear classfcato of SVM j / j Lear classfcato of SVM s realzed through solvg for the optmal separatg hyper-plae whe the trag set s lear separable. If the mgled two classes ( C1, C 2) of a sample ca be separated correctly wth the lear fucto ( H 0 ) a two-dmeso plae, ths sample s treated as lear separable. Suppose the trag set s{( x1, y1),( x2, y2),,( x, y )}, where x s sample formato vector ( x s the coordate vector a two-dmeso plae), y Y {1, 1} ad +1 represets class C 1, -1 represets class C 2. If the lear separatg hyper-plae H : 0 0 w xb separates the trag set correctly, t s equvalet wth the stuato: whe y 1, w x b 1; whe y 1, w x b 1. If the dstace of two data cluster of the sample, D, s maxmzed, ths hyper-plae s called the optmal separatg hyper-plae ths classfcato case. Defe D d d, d m{ w x b w} (4) y, 1 By substtutg w xb 1 (4), we ca obta D d d 2 w ad the problem s trasformed to get the w mmzg w. ( b ca be calculated by substtutg sample pots wth w kow) Addtoally, to avod the stuato that dstace betwee the two parallel hyper-plaes s maxmzed whle effectve classfcato s ot realzed, we must pose costrats o ths optmzato problem as follows: y( w x b) 1, 0 1. (5) s the slack varable to tolerate the outlers. Ad a pealty factor C s also troduced to the objectve fucto to reflect losses for toleratg the outlers. rag a SVM model,.e. solvg the optmzato problem, wll lead to a quadratc programmg problem, as show (6).

Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 409 1 max jyy j x, xj 1 2 1 j1 st.. 0 C, 1,2,, y 0 1 (6) Suppose s the soluto of (6) ad thus the optmal hyper-plae s b ca be calculated by the cotrats of (5).. w x b 0, where w yx ad 3.2. Nolear classfcato of SVM Lear classfcato of SVM we talked about the pror secto ca be oly appled whe sample s lear separable. I ths secto, a mproved olear SVM method s proposed to solve the complcated ad hghdmesoal facal ratos. A kerel fucto s very mportat here because t ca map the orgal date to hgh-dmesoal space H,.e. : R H; x ( x), whch ca let the data ca be lear separable H. he a optmal separatg hyperplae dscussed pror secto ca be obtaed to do the classfcato. Suppose the trag set s{( x1, y1),( x2, y2),,( x, y )}, x s the hghly dmesoal formato vector of the sample ad y Y {1, 1}. A quadratc programmg smlar wth (8) s obtaed through mappg : 1 max jyy j ( x), ( xj) 1 2 1 j1 st.. 0 C, 1,2,, y 0 1 (7) o solve (7), : R H; x( x) s eeded to kow, so we choose Gauss radal based kerel fucto (RBF) to get the er product value as kxy (, ) (), x () y drectly wthout searchg for the complex. 4. Data selecto able 1. Facal ratos ad sample stocks formato Sample stock Eargs ablty A Actvty rato B Shareholder retur C 2009, 677 stocks 2010, 679 stocks EBI a 1 ROA a 2 ROE a 3 urover of accouts recevable b 1 urover of vetory b 2 urover of curret assets b 3 EPS c 1 Prce-to-book rato c 2 Commo stock proftablty c 3 P/CF c 4 Cash ratos D Growth ratos E Rsk level F Solvecy ratos G EBI-to-Cash rato d 1 Cash-to-Assets rato d 2 Operatg rato d 3 Growth of total assets e Facal leverage f 1 Operatg leverage f 2 Quck rato g 1 Debt-to-Asset rato g 2 EBI/Iterest rato g 3 EBI/Fxed charge rato g 4 hs paper selects 7 categores of facal ratos of compaes A-share Shagha Stock Exchage from ther aual reports of 2009 ad 2010. he detaled facal dexes chose are show able 1. Our objectve s to

410 Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 separate the hgh-retur stocks from the low oes accordg to ther features hdde sde the facal ratos, thus t s ecessary to label each stock wth the retur characterstc. After statstcal aalyss, all the compaes have aouced ther aual report before 1th/May 2009 ad 2010. herefore we label the stock as +1 f ts retur raks the frst 25% of all the sample stocks,.e. y 1 ad y 1 for the rest stocks. Labels of a part of sample are preseted able 2. 5. Stock selecto of model ad aalyss 5.1. Extracto of trag set based o PCA method Facal ratos of 677 stocks 2009 are the orgal data. We apply PCA to extract the prcpal compoets satsfyg the codto of ACR 85%. Sce our sample s large, f we apply PCA o all of the ratos of 677 stocks drectly, we would lose the local formato ad the effect of dmeso reducto s also smaller. hus we do PCA extracto oe tme for every 40 sample stocks. he trag set s able 2. able 2. rag set of SVM olear classfcato (part of 677 stocks) Stock code Eargs ablty Actvty ratos Shareholder retur Cash ratos Growth ratos Rsk levels Solvecy ratos y PCA wth orm-stadardzato 600069-1.6114-0.9830-0.4337-1.0664-0.4253 0.7874 0.1431 1 600070 0.5249-0.3005-0.8563-0.5438-0.0903-0.1103 0.0136-1 600071 2.1843 0.1875-1.5191 1.1364-0.6570-1.7170 0.7624 1 PCA wth mea-stadardzato 600069 0.8222-1.3006 0.8049 1.0620-0.9571 0.3681 1.8768 1 600070 4.6133 1.0647-0.3712-1.1497 0.8309 1.6046 1.5020-1 600071 7.0948 1.1286-0.7982 0.2286 0.2485-0.2133 2.0515 1 5.2. SVM stock selecto model ad aalyss he total scores obtaed the pror secto combed wth retur labels of sample stocks costtute the complete trag set of SVM. By applyg the olear classfcato of SVM troduced secto 3 o the trag set, we ca obta the optmal separatg hyper-plae. If we use ths hyper-plae o test set, stocks test set ca be classfed to the hgh-retur part ad the low-retur part. It ca be see as a predcto of stocks future retur characterstc. he accuracy of classfcato ad predcto s preseted able 3. able 3. Accuracy of SVM olear classfcato Method used Mea-stadardzato PCA-SVM Norm-stadardzato PCA-SVM rag est Whole accuracy a 88.6905% 75.4464% Accuracy of +1 a 100% 58.5366% Accuracy of -1 a 85.0394% 80.9055% Whole accuracy b 69.1943% 61.7925% Accuracy of +1 b 10.1266% 24.5283% Accuracy of -1 b 88.8421% 74.2138% rag ad testg of SVM proceed wth Lvsvm 3.1 Matlab. o acheve the best geeralzato ablty, the optmal pealty factor C ad the coeffcet RBF s determed by Grd Searchg method.

Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 411 By observg able 3, we ca fd that the accuracy of mea-stadardzato PCA-SVM for label +1 trag set s 100%. However, the accuracy of the same label test set s oly 10.1266%. It s the over-fttg pheomeo that too may support vectors were used to expla the trag set, whch could has a good classfcato effect o trag set whle a bad effect o predctos. he accuracy of orm-stadardzato PCA-SVM s obvously better. For further aalyss, we costruct a equal weghted portfolos wth stocks selected by PCA-SVM ad do a comparso betwee the accumulated retur (ACR) gaed by ths model ad the A-share dex of Shagha Stock Exchage. he comparso s preseted Fg.1. It mafests that PCA-SVM has hgher accumulated retur over the A-share dex, whch meas SVM classfcato method s accurate ad hghly effcet whe dealg wth complex ad hghly dmesoal data. 6. Coclusos Fg.1. Comparso betwee PCA-SVM ad A-share dex of Shagha Stock Exchage Support Vector Mache s commoly used to tra the tme-seres data of stocks for prce forecastg. I ths paper, SVM s employed to geerate a optmal separatg hyper-plae hgh-dmesoal space based o the trag set. o crease the accuracy ad effcecy of SVM classfcato model, we apply PCA to process the orgal data. Fally, the emprcal result has suggested that the retur of stocks selected by PCA-SVM s apparetly superor to A-share dex. Iformato features of facal ratos of compaes vary wth ther dustres. We beleve that the qualty of trag set ca be mproved f we apply PCA o each dustry separately. Addtoally, t s qute meagful for achevg hgher returs f stocks could have dfferet weghts accordg to ther rsk-retur characterstcs whe portfolos are costructed. Ackowledgmets hs research was supported by the Natoal Natural Scece Foudato of Cha (Grat No. 71171176). Refereces 1. Mg Guo, Yua-Bao Zhag. A Stock Selecto Model Based o Aalytc Herarchy Process. Factor Aalyss ad OPSIS//he Iteratoal Coferece o Computer ad Commucato echologes Agrculture Egeer. 2010. p. 466-469. 2. Kuo R.J., Che C.H.& Hwag Y.C. A Itellget Stock radg Decso Support System hrough Itegrato of Geetc Algorthm based Fuzzy Neural Network ad Artfcal Neural Network. Fuzzy Sets ad Systems. 2001; 118: 21-45. 3. sumato S., Slowsk S., Komorowsk J. & Grzymala-Busse J.W. Lectureotes Artfcal Itellgece. he fourth teratoal coferece o rough sets ad curret treds computg. 2004.

412 Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 4. E.L. de Fara, Marcelo P. Albuquerque, J.L. Gozalez, J..P. Cavalcate, Marco P. Albuquerque. Predctg the Brazla Stock Market hrough Neural Networks ad Adaptve Expoetal Smoothg Methods. Expert Systems wth Applcato. 2009; 36:12506-12509. 5. Yudog Zhag, Lea Wu. Stock Market Predcto of S&P 500 va Combato of Improved BCO Approach ad BP Neural Network. Expert Systems wth Applcatos. 2009; 36: 8849-8854. 6. Vladmr N. Vapk. Statstcal Learg heory. Publshg House of Electrocs Idustry. 2004. 7. Ch-Yua Yeh, Ch-We Huag, She-Jue Lee. A multple-kerel support vector regresso approach for stock market prce forecastg. Expert Systems wth Applcatos.2011; 38: 2177-2186. 8. Pegpeg Huag. Predcto of the urover Pots Stock red Based o Support Vector Mache. College of Software, Fuda Uversty. 2010.