ScienceDirect. A SVM Stock Selection Model within PCA

Avalable ole at www.scecedrect.com SceceDrect Proceda Computer Scece 31 ( 2014 ) 406 412 2d Iteratoal Coferece o Iformato echology ad Quattatve Maagemet, IQM 2014 A SVM Stock Selecto Model wth PCA Huahua Yu a, Rogda Che b,, Guopg Zhag c a School of Face, Zhejag Uversty of Face & Ecoomcs, Hagzhou, 310018, Cha b School of Face, Zhejag Uversty of Face & Ecoomcs, Hagzhou, 310018, Cha c School of Ecoomcs ad Iteratoal rade, Zhejag Uversty of Face & Ecoomcs, Hagzhou, 310018, Cha Abstract I the facal market, well-performg stocks usually have some specfc features facal fgures. hs paper troduces a mache learg method of support vector mache to costruct a stock selecto model, whch ca do the olear classfcato of stocks. However, the accuracy of SVM classfcato s very sestve to the qualty of trag set. o avod the drect use of complcated ad hghly dmesoal facal ratos, we brg the prcpal compoet aalyss (PCA) to SVM model to extract the low-dmesoal ad effcet feature formato, whch mproves the trag accuracy ad effcecy as well as preserve the features of tal data. As emprcal results show, based o support vector mache, wth PCA after ormstadardzato, the stock selecto model acheves the etre accuracy of 75.4464% trag set ad of 61.7925% test set. Further, the PCA-SVM stock selecto model cotrbutes the aual eargs of stock portfolo to outperformg those of A- share dex of Shagha Stock Exchage, sgfcatly. 2014 Publshed by Elsever B.V. Ope access uder CC BY-NC-ND lcese. 2014 he Authors. Publshed by Elsever B.V. Selecto ad peer-revew uder resposblty of the Orgazg Commttee of IQM 2014. Selecto ad peer-revew uder resposblty of the Orgazg Commttee of IQM 2014. Keywords: mache learg; stock selecto; prcpal compoets aalyss; support vector mache 1. Itroducto Stock has always bee oe of the most popular vestmet strumets facal markets. Ivestors ad researchers are devotg themselves to study out a method that ca select accurately the stocks wth favorable future Correspodg author. el.: +860571-85750010; fax: +860571-85212001. E-mal address: rogdache@163.com. 1877-0509 2014 Publshed by Elsever B.V. Ope access uder CC BY-NC-ND lcese. Selecto ad peer-revew uder resposblty of the Orgazg Commttee of IQM 2014. do: 10.1016/j.procs.2014.05.284

Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 407 retur to be costtuets of vestmet portfolo. Guo ad Zhag 1, Kuo et al. 2 ad sumato et al. 3 develops several method to forecast stock prces or pck qualfed oes from large sample. However, some tradtoal stock selecto models usually face challeges whe dealg wth hghly dmesoal ad olear sample data for the reaso that stock selecto s a kd of determato wth mult objectves ad mult restrctos, alog wth the hghly dmesoal ad huge facal data. he mache learg-based theory, Artfcal Neural Network (ANN), ca capture the regular patters hdde behd the complex ad hgh-dmeso data through ts mache learg 4,5. Although ANN performs better tha tradtoal methods, t has lots of defects at the same tme, such as the dffculty to determe etwork structures, the problem wth local mmum pots ad the over-fttg. Vapk 6 proposed a ew mache learg-based method called Support Vector Mache (SVM), whch ca better hadle the hghdmeso data avodg the defects of ANN. SVM apples wdely may felds because of ts partcular advatages. A lot of researches, domestc ad abroad, use SVM to predct stock prces or reversal pots, as Yeh et al. 7 ad Huag 8. But t s seldom to establsh a stock selecto model by SVM, ad specfcally rare domestc. hs paper apples SVM to domestc stock market to establsh a effectve selecto model. We treat facal ratos of lsted compaes A-share of Shagha Exchage as orgal data, ad the use the prcpal compoets aalyss (PCA) to preprocess them. Frst, we establshed a stock selecto model (PCA-SVM) that recogzes hghretur stocks whe utlzed SVM theory to tra the trag set. Secod, apply PCA-SVM o test set to forecast the hgh-retur stocks the ext year ad do a comparso betwee the forecast ad the actual to llustrate effectveess of the establshed stock selecto model. 2. Prcpal compoets aalyss (PCA) Facal ratos of a lsted compay clude earg ablty, growth ablty, solvecy ablty ad so o. Each ablty cotas may sub-ratos. If all the ratos were used as puts the trag set, t would result redudacy ad low effcecy; eve decrease the qualty of emprcal results. New varables ca be created through trasformato of orgal varables. Number of varables s less ad most formato s stll retaed. hese ew varables are called prcpal compoets. 2.1. Defto of prcpal compoets Prcpal compoets ca be expressed as follows: Y1 1 X 11X112X2 1 X Y2 2 X 21X122X2 2X Y X X X X 1 12 2, (1) where X s the orgal varable, Y s the prcpal compoet ad s the coeffcet vector respectvely. ca be estmated by maxmzg Var( Y ) wth the costrat codtos of 1 ad Cov( Y, Yj) 0, j 1,2,, 1, where ( j ) s the covarace matrx of X. 2.2. Selecto of prcpal compoets he covarace matrx of X ( X1, X2,, X ), ( j ), s a symmetrc o-egatve defte matrx. herefore t has characterstc roots 1, 2,,, ad characterstc vectors. Suppose 1 2 0 ad the orthogoal ut egevectors are e 1, e 2,, e. he th prcpal compoet of X1, X2,, X ca be expressed as follows:

408 Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 Y e 1X1e2X2 ex, 1,2,, (2) wth Var( Y) e e ad Cov( Y, Yj) e e 0, j. he frst p prcpal compoets accumulated corbuto rate s p ACR( p) / (3) 1 1 whch represets the explaato power for orgal data of the prcpal compoets extracted by PCA method. Geerally, a ACR of 85% s at least requred, or the PCA method would be thought as usutable for losg too much orgal formato. Sce the covarace matrx s sestve to the order of magtudes of data, we eed to stadardze the data frst. here are two method of stadardzato commo use: Norm-stadardzato: X ( X X j )/ s, X j s the mea ad s j s the stadard devato. Mea-stadardzato: 3. Support vector mache j j j X X X, X j s the mea. j 3.1. Lear classfcato of SVM j / j Lear classfcato of SVM s realzed through solvg for the optmal separatg hyper-plae whe the trag set s lear separable. If the mgled two classes ( C1, C 2) of a sample ca be separated correctly wth the lear fucto ( H 0 ) a two-dmeso plae, ths sample s treated as lear separable. Suppose the trag set s{( x1, y1),( x2, y2),,( x, y )}, where x s sample formato vector ( x s the coordate vector a two-dmeso plae), y Y {1, 1} ad +1 represets class C 1, -1 represets class C 2. If the lear separatg hyper-plae H : 0 0 w xb separates the trag set correctly, t s equvalet wth the stuato: whe y 1, w x b 1; whe y 1, w x b 1. If the dstace of two data cluster of the sample, D, s maxmzed, ths hyper-plae s called the optmal separatg hyper-plae ths classfcato case. Defe D d d, d m{ w x b w} (4) y, 1 By substtutg w xb 1 (4), we ca obta D d d 2 w ad the problem s trasformed to get the w mmzg w. ( b ca be calculated by substtutg sample pots wth w kow) Addtoally, to avod the stuato that dstace betwee the two parallel hyper-plaes s maxmzed whle effectve classfcato s ot realzed, we must pose costrats o ths optmzato problem as follows: y( w x b) 1, 0 1. (5) s the slack varable to tolerate the outlers. Ad a pealty factor C s also troduced to the objectve fucto to reflect losses for toleratg the outlers. rag a SVM model,.e. solvg the optmzato problem, wll lead to a quadratc programmg problem, as show (6).

Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 409 1 max jyy j x, xj 1 2 1 j1 st.. 0 C, 1,2,, y 0 1 (6) Suppose s the soluto of (6) ad thus the optmal hyper-plae s b ca be calculated by the cotrats of (5).. w x b 0, where w yx ad 3.2. Nolear classfcato of SVM Lear classfcato of SVM we talked about the pror secto ca be oly appled whe sample s lear separable. I ths secto, a mproved olear SVM method s proposed to solve the complcated ad hghdmesoal facal ratos. A kerel fucto s very mportat here because t ca map the orgal date to hgh-dmesoal space H,.e. : R H; x ( x), whch ca let the data ca be lear separable H. he a optmal separatg hyperplae dscussed pror secto ca be obtaed to do the classfcato. Suppose the trag set s{( x1, y1),( x2, y2),,( x, y )}, x s the hghly dmesoal formato vector of the sample ad y Y {1, 1}. A quadratc programmg smlar wth (8) s obtaed through mappg : 1 max jyy j ( x), ( xj) 1 2 1 j1 st.. 0 C, 1,2,, y 0 1 (7) o solve (7), : R H; x( x) s eeded to kow, so we choose Gauss radal based kerel fucto (RBF) to get the er product value as kxy (, ) (), x () y drectly wthout searchg for the complex. 4. Data selecto able 1. Facal ratos ad sample stocks formato Sample stock Eargs ablty A Actvty rato B Shareholder retur C 2009, 677 stocks 2010, 679 stocks EBI a 1 ROA a 2 ROE a 3 urover of accouts recevable b 1 urover of vetory b 2 urover of curret assets b 3 EPS c 1 Prce-to-book rato c 2 Commo stock proftablty c 3 P/CF c 4 Cash ratos D Growth ratos E Rsk level F Solvecy ratos G EBI-to-Cash rato d 1 Cash-to-Assets rato d 2 Operatg rato d 3 Growth of total assets e Facal leverage f 1 Operatg leverage f 2 Quck rato g 1 Debt-to-Asset rato g 2 EBI/Iterest rato g 3 EBI/Fxed charge rato g 4 hs paper selects 7 categores of facal ratos of compaes A-share Shagha Stock Exchage from ther aual reports of 2009 ad 2010. he detaled facal dexes chose are show able 1. Our objectve s to

410 Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 separate the hgh-retur stocks from the low oes accordg to ther features hdde sde the facal ratos, thus t s ecessary to label each stock wth the retur characterstc. After statstcal aalyss, all the compaes have aouced ther aual report before 1th/May 2009 ad 2010. herefore we label the stock as +1 f ts retur raks the frst 25% of all the sample stocks,.e. y 1 ad y 1 for the rest stocks. Labels of a part of sample are preseted able 2. 5. Stock selecto of model ad aalyss 5.1. Extracto of trag set based o PCA method Facal ratos of 677 stocks 2009 are the orgal data. We apply PCA to extract the prcpal compoets satsfyg the codto of ACR 85%. Sce our sample s large, f we apply PCA o all of the ratos of 677 stocks drectly, we would lose the local formato ad the effect of dmeso reducto s also smaller. hus we do PCA extracto oe tme for every 40 sample stocks. he trag set s able 2. able 2. rag set of SVM olear classfcato (part of 677 stocks) Stock code Eargs ablty Actvty ratos Shareholder retur Cash ratos Growth ratos Rsk levels Solvecy ratos y PCA wth orm-stadardzato 600069-1.6114-0.9830-0.4337-1.0664-0.4253 0.7874 0.1431 1 600070 0.5249-0.3005-0.8563-0.5438-0.0903-0.1103 0.0136-1 600071 2.1843 0.1875-1.5191 1.1364-0.6570-1.7170 0.7624 1 PCA wth mea-stadardzato 600069 0.8222-1.3006 0.8049 1.0620-0.9571 0.3681 1.8768 1 600070 4.6133 1.0647-0.3712-1.1497 0.8309 1.6046 1.5020-1 600071 7.0948 1.1286-0.7982 0.2286 0.2485-0.2133 2.0515 1 5.2. SVM stock selecto model ad aalyss he total scores obtaed the pror secto combed wth retur labels of sample stocks costtute the complete trag set of SVM. By applyg the olear classfcato of SVM troduced secto 3 o the trag set, we ca obta the optmal separatg hyper-plae. If we use ths hyper-plae o test set, stocks test set ca be classfed to the hgh-retur part ad the low-retur part. It ca be see as a predcto of stocks future retur characterstc. he accuracy of classfcato ad predcto s preseted able 3. able 3. Accuracy of SVM olear classfcato Method used Mea-stadardzato PCA-SVM Norm-stadardzato PCA-SVM rag est Whole accuracy a 88.6905% 75.4464% Accuracy of +1 a 100% 58.5366% Accuracy of -1 a 85.0394% 80.9055% Whole accuracy b 69.1943% 61.7925% Accuracy of +1 b 10.1266% 24.5283% Accuracy of -1 b 88.8421% 74.2138% rag ad testg of SVM proceed wth Lvsvm 3.1 Matlab. o acheve the best geeralzato ablty, the optmal pealty factor C ad the coeffcet RBF s determed by Grd Searchg method.

Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 411 By observg able 3, we ca fd that the accuracy of mea-stadardzato PCA-SVM for label +1 trag set s 100%. However, the accuracy of the same label test set s oly 10.1266%. It s the over-fttg pheomeo that too may support vectors were used to expla the trag set, whch could has a good classfcato effect o trag set whle a bad effect o predctos. he accuracy of orm-stadardzato PCA-SVM s obvously better. For further aalyss, we costruct a equal weghted portfolos wth stocks selected by PCA-SVM ad do a comparso betwee the accumulated retur (ACR) gaed by ths model ad the A-share dex of Shagha Stock Exchage. he comparso s preseted Fg.1. It mafests that PCA-SVM has hgher accumulated retur over the A-share dex, whch meas SVM classfcato method s accurate ad hghly effcet whe dealg wth complex ad hghly dmesoal data. 6. Coclusos Fg.1. Comparso betwee PCA-SVM ad A-share dex of Shagha Stock Exchage Support Vector Mache s commoly used to tra the tme-seres data of stocks for prce forecastg. I ths paper, SVM s employed to geerate a optmal separatg hyper-plae hgh-dmesoal space based o the trag set. o crease the accuracy ad effcecy of SVM classfcato model, we apply PCA to process the orgal data. Fally, the emprcal result has suggested that the retur of stocks selected by PCA-SVM s apparetly superor to A-share dex. Iformato features of facal ratos of compaes vary wth ther dustres. We beleve that the qualty of trag set ca be mproved f we apply PCA o each dustry separately. Addtoally, t s qute meagful for achevg hgher returs f stocks could have dfferet weghts accordg to ther rsk-retur characterstcs whe portfolos are costructed. Ackowledgmets hs research was supported by the Natoal Natural Scece Foudato of Cha (Grat No. 71171176). Refereces 1. Mg Guo, Yua-Bao Zhag. A Stock Selecto Model Based o Aalytc Herarchy Process. Factor Aalyss ad OPSIS//he Iteratoal Coferece o Computer ad Commucato echologes Agrculture Egeer. 2010. p. 466-469. 2. Kuo R.J., Che C.H.& Hwag Y.C. A Itellget Stock radg Decso Support System hrough Itegrato of Geetc Algorthm based Fuzzy Neural Network ad Artfcal Neural Network. Fuzzy Sets ad Systems. 2001; 118: 21-45. 3. sumato S., Slowsk S., Komorowsk J. & Grzymala-Busse J.W. Lectureotes Artfcal Itellgece. he fourth teratoal coferece o rough sets ad curret treds computg. 2004.

412 Huahua Yu et al. / Proceda Computer Scece 31 ( 2014 ) 406 412 4. E.L. de Fara, Marcelo P. Albuquerque, J.L. Gozalez, J..P. Cavalcate, Marco P. Albuquerque. Predctg the Brazla Stock Market hrough Neural Networks ad Adaptve Expoetal Smoothg Methods. Expert Systems wth Applcato. 2009; 36:12506-12509. 5. Yudog Zhag, Lea Wu. Stock Market Predcto of S&P 500 va Combato of Improved BCO Approach ad BP Neural Network. Expert Systems wth Applcatos. 2009; 36: 8849-8854. 6. Vladmr N. Vapk. Statstcal Learg heory. Publshg House of Electrocs Idustry. 2004. 7. Ch-Yua Yeh, Ch-We Huag, She-Jue Lee. A multple-kerel support vector regresso approach for stock market prce forecastg. Expert Systems wth Applcatos.2011; 38: 2177-2186. 8. Pegpeg Huag. Predcto of the urover Pots Stock red Based o Support Vector Mache. College of Software, Fuda Uversty. 2010.