CS229 Final Project. Wentao Zhang Shaochuan Xu

CS229 Final Project Shale Gas Production Decline Prediction Using Machine Learning Algorithms Wentao Zhang wentaoz@stanford.edu Shaochuan Xu scxu@stanford.edu In petroleum industry, oil companies sometimes purchase oil and gas production wells from others instead of drilling a new well. The shale gas production decline curve is critical when assessing how much more natural gas can be produced for a specific well in the future, which is very important during the acquisition between the oil companies, as a small under- estimate or over- estimate of the future production may result in significantly undervaluing or overvaluing an oilfield. In this project, we use the Locally Weighted Linear Regression to predict this future production based on the existing decline curves; Then, we apply the K- means to group the decline curves into two categories, high and low productivity; Moreover, Principal Component Analysis is also tried to calculate the eigenvectors of the covariance matrix, based on which we also predict the future production both with K- means as a preprocess and without K- means. At last, three methods are compared with each other in terms of the accuracy defined by a standardized error. l Dataset The data we used are ly production rate curves of thousands of shale gas wells as below. In order to deal with different lengths of different curves, and the production rate data points of some curves, we modify these data a little bit. We substitute data points in any curve by a very small number.1, and we make all the curves the same length by adding zeros to the end, for the sake of being loaded into MATLAB as a matrix. 18 shale gas production decline - all curves 16 1 1 1 8 6 2 4 6 8 1 12 14 Figure 1 2152 decline curves used to learn to predict production in the future l Locally Weighted Linear Regression (LWLR) Our goal is to predict the future gas production of a new well, given its historical production data and information from other wells with longer history. Suppose that we randomly choose a decline curve r with n s in total. We want to use the first l to

predict the rest (n- l) s of the curve. In order to find curves from the training set that are similar to r, we define the distance between two curves by squared L- 2 norm. Before we calculate the distance, we need to filter the training set by removing curves whose history is shorter than n. Then we pick k wells from the filtered training set that are closest to r, give each of them a weight w i and make prediction for r as: f predicted = i neighb k ( f past _existing )! Where h is the longest distance. i neighb k ( f past _existing ) w(d( f past _existing, f measured )/h) f future_existing w(d( f past _existing, f measured )/h) Results: Figure 2 We restrict the number of neighbors k equal to 3. In Figure 2, four typical s are shown. The results are generally consistent with the real values. Comparatively, the s are smoother than the real ones, because the predictions are the sum of multiple training wells. 5 3 1 5 3 1 2 4 6 8 1 12 2 4 6 8 1 12

5 3 1 Figure 3 2 4 6 8 1 12 Standarlized Error 1.2 1.8.6.4.2 2 4 6 8 1 12 Month Predicted curves with l (known s) increasing for a good fitting; error vs. l 5 3 1 5 3 1 Figure 4 2 4 6 8 1 12 2 4 6 8 1 12 Standarlized Error 5 3 1 1.2 1.8.6.4.2 2 4 6 8 1 12 2 4 6 8 1 12 Month Predicted curves with l (known s) increasing for a bad fitting; error vs. l In Figure 3 and Figure 4, we change the from short to long, and plot the error versus the known s. The error does not decrease when we know longer curve and predict shorter. The reason might be that the Standardized Error is defined as the average relative error of predicted s. For this reason, in the tail of the curve, since absolute values are small, relative errors are easily to be large. A better error needs to be defined if we really want to tell if the prediction is better with longer. l Principal Component Analysis (PCA) Since each well has a history as many as tens of s, intuitively, we want to reduce the dimensions of time and keep the intrinsic components that reflect production decline. First of all, we filter the training set by removing the wells whose history is shorter than the total s n of a. After the normalization on the data, we eigen- decompose the empirical covariance matrix and extract the first 5 eigenvectors as the principal components. Then, we fit the known part of the by using a linear combination of 5 eigenvectors. The coefficients θ of the linear combination are calculated from linear regression, y known = Uθ l And we predict the future decline curve as, y estimate = U θ h

Where y known R l is the normalized known history of the test well, U l R l*5 is the eigenvectors with the first l dimensions. Our estimation is therefore y estimate. Results: Figure 5 As can be seen from Figure 5, the prediction is either too smooth or too variant compared to the real data. This is because at the fitting step, θ is either underfitted (high variance) or overfitted (low variance). Another problem in PCA is that all the training wells have contribution to the estimation, which makes it unprecise for very high or low production prediction. l PCA after K- means If we assume high- productivity wells are similar to each other and low- productivity wells are similar to each other, we can group all the decline curves into two categories. We modify the K- means method to be applied into this real situation that different decline curves have different dimensions. We calculate the distance between a centroid and a curve by using the dimension of the shorter one. As comparing two figures in Figure 6, this modified K- means method is good enough to distinguish high- productivity wells from low- productivity wells. Figure 6 Decline curves in the high productivity wells (left) and the low productivity wells (right)

Then, we run PCA again after clustering the original decline curves by K- means. Figure 7 From Figure 7, we can see that although the underfitting/overfitting problem still exists, the results are better than the original PCA. This might be due to we add the L- 2 norm distance information into the PCA, which makes it an integrated method. l Discussion Figure 8 Errors of three methods calculated from Leave-One-Out cross validation We apply Leave- One- Out cross validation to all the three methods, compare the predictions with real production data and calculate the average relative errors as in Figure 3 and Figure 4. We also define a threshold value to avoid the extremely large errors. The reason we do this is that one extreme value can make the average of all relative errors really huge, but these extreme values are due to the shutting down periods of the wells (when the production is nearly zero). Figure 8 verifies our intuition that LWLR is the best among the three methods because no information is lost due to dimension reduction. PCA has the largest relative error among three methods because higher order principal characteristics, reflecting the details of decline curves, are not included. K- means helps cluster the wells into high and low productivity classes, which improves PCA with the availability of that prior information.