12 slots, 2 hours each. A homework: visualization, simple testing, and simple classification algorithms.

Size: px

Start display at page:

Download "12 slots, 2 hours each. A homework: visualization, simple testing, and simple classification algorithms."

Clinton Garrett
5 years ago
Views:

2 12 slots, 2 hours each. A homework: visualization, simple testing, and simple classification algorithms.

3 Approximate Syllabus: Organization and structure. Intro to R. Set operations. Venn diagramms. De Morgan s laws. Probability. Tutorial in R. Descriptive statistics. Plots in R. Conditional probability and Bayes theorem. Random variables and their distributions. Expectations, moments and transformations. Markov s inequality. Chebyshev s inequality. Some univariate discrete and continuous distributions. Sampling distributions and main large scale sample theorems. Normal distribution. Central Limit Theorem. T-distributions, F-distribution. Testing the significance. p<0.05. One sample Z-test. One sided and two sided tests. The p-value. Testing miu with unknown sigma. The t- test. Testing the variance. Type I and II Errors. The power of a test. Hypothesis testing for two and more samples. ANOVA testing. Other tests. Correlation and association analysis. Chi-square test. Entropy. Mutual information. Linear correlation. Intraclass correlation.

4 Approximate Syllabus (ctd): Modelling of data. Linear regression. Maximum likehood estimation. Model diagnostics. Logistic regression and odds ratio. Stepwise regression and finding the best model. Rigid and lasso algorithms. Classification. LDA. Nearest centroid. knn. Artificial Neural Network. SVM. Dimension reduction. Cross validation. Assess performance of the classifier. Accuracy. Sensitivity. Specificity. Matthews correlation coefficient. Introduction to Perceptron. Multilayer NN. PCA vs LDA. Unsupervised learning. K-means algorithm. Hierarchical clustering. Nearest-Neighbour algorithm. Parenclitical Network Analysis. Integrated Information. Genetic intelligence. Alexey Zaikin/Oleg Blyuss. Approximate Bayesian Calculations. Importance sampling. MCMC - Markov Chain Monte Carlo. Case study: serial oncomarkes. Thomas Bartlett: Sparse Statistical Modelling

5 , SAS, Stata

8 Install R-studio!

9 1.6 First look at probability A first look at Probability vs Statistics: Probability: deals with formalizing the mechanism that generated the data. Given a model written in terms of probability we can then study its mathematical properties and understand or predict which events are likely to happen in the future or under different scenarios Statistics: involves the analysis of the frequency of past events. Historical data can be used to test whether a probability model a suitable or not. If it is, the probability model will help us to understand the situation and guide us in making decisions. Probability Ideal world of the model Real world Statistics

1.8 Stochasticity Description of the real world as probabilistic world is actually natural Thermodynamics: to describe the impact of huge number of

10 1.8 Stochasticity Description of the real world as probabilistic world is actually natural Thermodynamics: to describe the impact of huge number of molecules Quantum dynamics: the Heisenberg uncertainty principle. Is stochasticity a fundamental property of our world? Deterministic chaos. The Lorenz attractor:

11 1.7 Sample space Any process of observation is referred to as an experiment. The results of an experiment are its outcomes Probability is a way of expressing knowledge or belief that an event will occur or has occurred. To define the probability we will need to define a set S consisting of all possible outcomes of the experiment. The sample space S is the set of all possible outcomes of a random experiment An element s of S is a sample point A sample space S is said to be discrete if it consists of a finite number of sample points countable if its elements can be placed in a one-to-one correspondence with positive integers continuous if the sample points consitute a continuum The set containing no element is called the null or empty set and is denoted by ø. This is the unique set that contains no elements.

13 ?

14 4}

16 1.14 Diagramms

17 1.15 More on diagramms

18 1.16 Partitions

19 1.19 Summary

20 1.21 Interpretations

26 1.27 Lifetime of cells

27 Tutorial In R

28 1.27 Data structures in R 1.28 Data Structure in R

29 1.29 Data Structure in R

30 1.30 Data Structure in R

31 1.31 Data Structure in R

32 1.32 Data Structure in R

33 1.33 Data Structure in R

34 1.33 Data Structure in R

35 1.34 Data Structure in R

36 1.35 Data Structure in R

37 1.36 Data Structure in R > attributes(d) $names [1] "x" "y" $row.names [1] $class [1] "data.frame"

38 1.37 Operations with matrix elements

39 1.38 Getting data in R There are different opportunities to get data in R: 1.Read them from file, txt file or Excel file

40 1.39 Getting data in R 2. Generate data inside R

41 2. Download data from R databank 1.40 Getting data in R

42 Descriptive statistics

43 1.43 Phases of a statistical analysis

44 1.44 Random sample and parametric modelling

45 1.45 Phases on a data analysis

46 1.46

47 1.47 Initial stage

48 1.48

49 1.49 measurements of central tendency

50 1.50 Quantiles and range

51 1.51 Skewness

52 1.52

53 1.53 Mean, mode and median in skewed sample distributions

54 x11(width=5,height=4) par(mar=c(1,1,1,1)*5) layout(matrix(1:1,1,1)) x=seq(0,10,by=0.1) plot(x,sin(x),type="l",col="red") dev.copy2eps(file="sin.ps") OR dev.copy2pdf(file="sin.pdf") 1.54 Writing the plot in the file

55 library(usingr) simple.hist.and.boxplot(rnorm(100,mean=1,sd=1)) 1.55 Histogramms

56 boxplot(case[,7],case[,8],col=c("red","blue"),notch=t) 1.59 Boxplots with notches are plotted using the following numbers: 0.25, 0.5 and 0.75 quartiles standing for box bottom, horizontal line and box top, samples extremes for whiskers, and 95% median confidence interval for notches. The confidence interval for the median is calculated as +/- 1.58IQR / n where IQR is the interquartile interval and n is the sample size.

57 # ##################################################### case<-read.csv("case_wo_outlier.txt", header=t,sep="\t") ############################################## Scatterplot Matrices from the glus Package ############################################## postscript('casescatterplots2.ps') library(gclus) dta <-case[,2:7] dta.r <- abs(cor(dta)) # get correlations dta.col <- dmat.color(dta.r) # get colors # reorder variables so those with highest correlation # are closest to the diagonal dta.o <- order.single(dta.r) cpairs(dta, dta.o, panel.colors=dta.col, gap=.5, main="case" ) dev.off() cor(case[,2:7]) ############################################## Scatterplot Matrices from the car Package ############################################## library(car) postscript('casescatterplots.ps') scatterplot.matrix(case[,2:7], data=null, diagonal=c("histogram"), main="case study",dev.off()) 1.56 Sophisticated scatter plots

58 5.12

59 5.12

60 Try by yourself: x=rnorm(200,mean=1,sd=1) x1=rnorm(200,mean=2,sd=1) x4=rnorm(200,mean=4,sd=1) x2=rnorm(200,mean=2,sd=1) x3=rnorm(200,mean=1.5,sd=1) library(gclus) dta=cbind(x,x1,x2,x3,x4) dta=cbind(x,x1,x2,x3) dta.r <- abs(cor(dta)) # get correlations dta.col <- dmat.color(dta.r) # get colors dta.o <- order.single(dta.r) cpairs(dta, dta.o, panel.colors=dta.col, gap=.5, main="case" ) library(car) scatterplot.matrix(dta, data=null, diagonal=c("histogram"), main="case study") 60

61 x=rnorm(300,mean=1,sd=3) y=rnorm(300,mean=1,sd=3) z=rnorm(300,mean=1,sd=3) t=rnorm(300,mean=1,sd=3) 1.60 multidimensional vizualization x1=rnorm(300,mean=10,sd=3) x2=rnorm(300,mean=10,sd=3) x3=rnorm(300,mean=1,sd=3) x4=rnorm(300,mean=1,sd=3) library(rggobi) data=rbind(cbind(x,y,z,t),cbind(x1,x2,x3,x4)) c = ggobi(data) glyph_colour(c[1])<-c(rep(3,300),rep(4,300)) Allows visualization in multi (>3D) dimensional space!

62 Suppose we have data0x and data12x data files. Plot densities of columns 1 maxi=max(c(data0x[,1],data12x[,1]),na.rm =T) mini=min(c(data0x[,1],data12x[,1]),na.rm =T) 1.61 plotting densities plot(density(data0x[,1],adjust=0.7,na.rm=t), xlab=names(data0x)[1],ylab="probability", main="", col="blue",type="l",xlim=c(mini,maxi),ylim=c(0,ylim),lwd=2 ) points(density(data12x[,1],adjust=0.7,na.rm=t), col="red",type="l",lwd=2) points(c(mean(data0x[,1],na.rm=t),mean(data0x[,1],na.rm=t)),c(0,1.0),type="l",col="blue",lwd=0.6) points(c(mean(data12x[,1],na.rm=t),mean(data12x[,1],na.rm=t)),c(0,1.0),type="l",col="red",lwd=0.6)

63 layout(matrix(c(1:6),3,2)) for(i in 1:6){ maxi=max(c(data0x[,i],data12x[,i]),na.rm =T) mini=min(c(data0x[,i],data12x[,i]),na.rm =T) v=c(0:20)*(maxi-mini)/20.0+mini h1=hist(data0x[,i],plot=f,breaks=v) h2=hist(data12x[,i],plot=f,breaks=v) maxi2=max(c(h1$density,h2$density)) 1.62 array of plots if(i==3) {ylim=1.4; maxi=10 } else if(i==4) {ylim=2.5; maxi=4 }else if(i==5) {ylim=1.2; maxi=4 }else if(i==6) {ylim=1.; maxi=7 } else ylim=maxi2 plot(density(data0x[,i],adjust=0.7,na.rm=t), xlab=names(data0x)[i],ylab="probability", main="", col="blue",type="l",xlim=c(mini,maxi),ylim=c(0,ylim),lwd=2 ) points(density(data12x[,i],adjust=0.7,na.rm=t), col="red",type="l",lwd=2) points(c(mean(data0x[,i],na.rm=t),mean(data0x[,i],na.rm=t)),c(0,1.0),type="l",col="blue",lwd=0.6) points(c(mean(data12x[,i],na.rm=t),mean(data12x[,i],na.rm=t)),c(0,1.0),type="l",col="red",lwd=0.6) } dev.copy2eps(file=paste("densities_0_12",".ps",sep=""))

64 layout(matrix(c(1:4),2,2)) a=0 page=0 for(i in 1:5){ for(j in (i+1):6){ if(a>=4){ a=0 page=page+1 dev.copy2eps(file=paste("scatterplots_",page,".ps",sep="")) } a=a+1 max_x=max(c(data0xx[,i],data12x[,i]),na.rm =T) min_x=min(c(data0xx[,i],data12x[,i]),na.rm =T) max_y=max(c(data0xx[,j],data12x[,j]),na.rm =T) min_y=min(c(data0xx[,j],data12x[,j]),na.rm =T) 1.63 plotting on several pages plot(data0xx[,i],data0xx[,j],col="blue",main="scatter plots", xlab=names(data0xx)[i],ylab=names(data0xx) [j],xlim=c(min_x,max_x),ylim=c(min_y,max_y),cex=0.2) points(data12x[,i],data12x[,j],col="red",cex=0.2) } } page=page+1 dev.copy2eps(file=paste("scatterplots_",page,".ps",sep=""))

Course in Data Science

Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an