Internet Engineering Jacek Mazurkiewicz, PhD

Internet Engineering Jacek Mazurkiewicz, PhD Softcomputing Part 11: SoftComputing Used for Big Data Problems

Agenda Climate Changes Prediction System Based on Weather Big Data Visualisation Natural Language Processing Methods Used for Automatic Prediction Mechanism of Related Phenomenon

Climate Changes Prediction System Based on Weather Big Data Visualisation Antoni Buszta Dolby Poland Sp. z o.o. Jacek Mazurkiewicz Department of Computer Engineering Wrocław University of Technology, POLAND

Goal & Assumptions Goal: - new approach to weather forecasting, - prediction process consisted of processing big data, - turning processed data to visualization, - visualization has been used for enhancing forecasting methods using artificial neural networks. Assumptions: 1. 2. 3. 4. data visualization gives additional interpretation possibilities, it is possible to enhance weather forecasting by data visualization, neural networks can be used for visual weather data analysis, neural networks can be used for climate changes prediction.

Classic Weather Data Analysis classic weather forecasting 650BC Babylonians started to predict weather from cloud patterns even early civilizations used reoccurring events to monitor changes in weather and seasons five main weather forecasting methods Persistence Trends Climatology Analog Numerical Weather Prediction

Five Main Weather Forecasting Methods Persistence - the most simple and primitive method, - conditions will not change during whole time forecast, - well performing for regions where weather does not change - sunny and dry regions of California in the USA Trends - basing on speed and directions of weather changes, - simple trend model can be calculated, - atmospheric fronts, location of clouds, precipitation, - high and low pressure centers are main factors, - accurate for forecast weather only few hours ahead, - detecting storms or upcoming weather improvement

Five Main Weather Forecasting Methods Climatology - bases on statistics and uses long term data accumulated, - take into account wider spectrum of changes occurring, - can provide more accurate local predictions, - main disadvantage is ignoring extremes, - whole model output is averaged, - does not recalculate on radical climate changes - it can provide year by year incorrect output Analog - more complicated than previous ones, - involves collecting data about current weather, - trying to find analogous conditions in the past, - assumes - current situation same as in archival data Numerical Weather Prediction - mathematical models of the atmosphere to calculate possible forecast - short and long term weather predictions

Weather Visualization Methods satellite photo visualization of pressure using isobars visualization of temperature using isotherms visualization of wind using isotachs visualization of dew point using isodrosotherms

Weather Data Set HadCRUT3 records of the global temperature Intergovernmental Panel on Climate Change (IPCC) data designated by the World Meteorological Organization (WMO) contains monthly temperature values for more that 3000 land stations National Climatic Data Center from Asheville, NC, USA reports collecting global summary of the day data (GSOD) data which creates the Integrated Surface Data (ISD) daily summaries is obtained from USAF Climatology Center beginning from 1929, over 9000 land and sea stations are monitored both of this data sources are available freely for non-commercial use daily reports from GSOD contain multiple collectible data types which can serve as perfect input to big data processing data was provided in Imperial units but in later processing it have been converted to Metric system

Neural Network Design (1) Configuration description Input parameters Input neurons Output parameter Output neurons Seven days forecast based temperature only. This method is very similar to analog forecasting Temperature 7 Temperature 1 Seven days forecast based on temperature and number of day in the year. Similar to previous one, but considers also transience of seasons by knowing which part of the year it is Temperature day of the year 8 Temperature 1 One day forecast based on temperature, mean station pressure and wind speed for that day. Analog forecasting aware of multiple weather parameters Temperature mean station pressure for the day and mean wind speed 3 Temperature 1

Neural Network Design (2) - input temperature from previous two days, pressure and wind speed from one previous day should provide one day temperature forecast, algorithm needs to keep history of three days of measurements, two days will be used as neural network input, last day will be used for validation or testing if provided answer is correct., neural network build for the purposes of forecasting application was trained using most classic method of training backpropagation, neural network is not being trained until convergence, it is impossible to find optimal solution and create model which will perfectly correspond weather conditions in any situation.

Learning Benchmarks (1) - - 7 days of temperature measurements one look ahead temperature prediction 7 input neurons and 1 output neurons three places in the world were chosen to benchmark neural network performance and effectiveness these cities were Opole, Helsinki and Mexico City choice has been made based on how fast is weather changing within one year in each place what is more in these cities there has been collected enough meteorological data for creating representative dataset and later neural network Opole is a city in southern Poland on the Oder River (Odra), can be characterized by moderate winters and summers Helsinki is the capital and largest city of Finland located in southern Finland, on the shore of the Gulf of Finland, an arm of the Baltic Sea, continental climate and it has lower minimal temperature than previous one Mexico City is capital of Mexico. It has a subtropical highland climate, and due to high altitude of 2,240 m temperature below 3oC and above 30oC

Learning Benchmarks (2)

Climate Change Prediction (1) - - weather prediction was targeted to whole year 2013, it was the last year for which dataset with complete input to output relations could be acquired, first forecast was created based on 7 day temperature input, additionally neural network had information which part of the year it was, by providing number of the day in year as one of the network inputs, model created for Opole city - data beginning with 1973 and ended with 2012, year 2013 was used for testing forecasting possibilities better results are in months where temperature is closer to yearly average, for Helsinki - neural network better performed in summer, atmospheric conditions were moderate, algorithms were not correctly detect temperature drop in end of February, best results for Mexico City.

Climate Change Prediction (2)

Climate Change Prediction (3)

Climate Change Prediction (4)

Conclusions results satisfactory and proved that it is possible to use neural network for weather forecasting big data visualization on the other hand provided easy access to data which in another way would be hard to process analysis showed that generative algorithms should be used in favor of iterative ones due to better processing speed and lower memory consumption executed test and benchmarks, and further analysis have shown that developed application creates valid forecasting reports and it can be successfully used for prediction of multiple weather parameters

Natural Language Processing Methods Used for Automatic Prediction Mechanism of Related Phenomenon Krystian Horecki Nokia Networks, Technology Center Wroclaw, POLAND Jacek Mazurkiewicz Department of Computer Engineering Wrocław University of Technology, POLAND

Goal & Assumptions Goal: an idea to combine variety of Natural Language Processing techniques with different classification methods as a tool for automatic prediction mechanism of related phenomenon paper proposes the new, promising ways of automatic data and content mining methods for the big data systems presented accuracy results are much better than average classification for sentimental analysis done by the human Assumptions: different types of preprocessing techniques are used and verified, in order to find the best set of them approach allows to recognize the phenomenon related to the text the real input from the big data systems the news website articles are the source of raw text data

Motivation millions of texts in web hard to determine meaning and reason of text existence it is important to understand what is happening around you automatically many existing text mining solutions (Angoss, SAS and many more) well known and researched Machine Learning techniques lack of solutions connecting text analysis with related phenomenon no existing solutions covering paper topic web text analysis finding related phenomenon based on text parameters usage of Machine Learning techniques usage of Natural Language Processing techniques

Input Data / Text Filtering Levels (1) real articles from web automatically extracted manually categorized sentimental analysis data stored in database after processing more than 1000 articles Level 1 Punctuation and stop words filtering Short words filtering Lemmatisation Level 2 Related words filtering Level 3 Adjectives to nouns transformation

Input Data / Text Filtering Levels (2) Level 1 - filtering is mostly focused on short words, stop words and punctuation marks removal, includes also conversion to lowercase - it also uses lemmatisation techniques which were used to reduce number of similar words Level 2 - filtering is based on semantic trees analysis and removal of similar words according to neighborhood in the tree, - technique which uses semantic trees to reduce words number - it was the authors idea to combine such extensive related words merging with text categorization Level 3 - filtering is connected with removal of adjectives and replacing them with corresponding nouns, - method should give accuracy enhancement in case of usage words such polish and Poland, so we could get the same word two times instead of having different words - authors idea was to combine this method with text categorization

Level 2 Filtering Algorithm

Natural Language Processing dividing text into tokens - called also as tokenization, usage of ready-to-use semantic trees, usage of text dictionaries, frequency distribution analysis, lemmatization each of listed methods should give additional accuracy enhancement that should be examined by testing different filtering levels most of them taken from the NLTK library some required additional custom implementation the usage of the ready-to-use dictionaries and corpuses which contain already collected data for different purposes two corpuses used during the implementation: a stop words dictionary and a wordnet dictionary wordnet corpus was the most important one because it allowed to analyze the relations between examined words having such large lexical database of English it was possible to match words having the same meaning but different form

Classification Mechanisms (1) each classifier has unified interface and can be used separately interface should contain: learning method, testing method, classifying method learning method receive a training set and a train classifier object classifier instance can be tested with the test method that receives test data set and returns an average accuracy for all test set elements it is possible to classify one feature vector classifiers were used to test whether our additional techniques improved the classification Neural Network Classifier was implemented as Multilayer Perceptron with custom parameters such as a number of hidden neurons, a number of hidden layers, a number of input and output neurons, a type of the network and types of the activation functions Max Entropy Classifier was used with improved Iterative scaling algorithm without Gaussian prior Naive Bayes Classification Algorithm - due to simplicity and popularity Classifier based on Decision Tree with maximum depth of 100 and the use of a single n-way branch for each feature

Classification Mechanisms (2) testing of the classifiers was designed and implemented that each classifier could be tested with different parameters core part of the test mechanism is a definition of test case where user can put data regarding the values of parameters which are later used by the classifiers such approach makes it possible to check how different classifiers behave against changing parameters testing is done by first shuffling dataset and later by splitting it into 2 parts classifier is trained with a training part of the data set and later is tested with the test data set average accuracy of classification is a result for the classifier testing testing is done for each filtering level each classifier is tested many times with shuffled data set which removes chance of wrong results and let user calculate an average accuracy from those many iterations results of each test case, each classifier, each filtering level are stored in database

Algorithm for Classifiers Testing

Multilayer Perceptron Classifier test cases examined the relation between a number of training epochs for Multilayer Perceptron Classifier and the classification accuracy possibly minimal number of training epochs can make learning time shorter and it means that training is more efficient number of training epochs was set as a range of values between 1 and 20 number of features which were extracted from the data set was set to 100 only 30 most informative features selected using Naive Bayesian Classifier. test for each number of epochs was repeated 15 times in order to get average results test gave very important outcome which is information that any number of training epochs bigger than 3 can give proper classification results that made later tests much shorter conclusion is that neural networks does not have to be trained with big amount of learning epochs when the big amount of data is used for the training.

Maxent Classifier test cases examined the relation between number of training epochs for Maxent Classifier and classification accuracy number of iterations was examined in order to get information when a number of iterations is sufficient number of training iterations was set as a range of values:1 and 20 number of features which were extracted from data set: 100 20 most informative features selected using Naive Bayesian Classifier test for each number of epochs was repeated 15 times in order to get average results for Maxent Classifier stable classification accuracy after the 9th iteration that classification accuracy for the 3rd level of input data filtration was better during early training stages, it can be noticed that results in iterations between 10 and 20 are quite similar for all filtering data set used for testing contained 1039 articles, each article had at least 200 words

Results of test for relation between classification accuracy and number of feature words in case of Naive Bayesian Classifier

Results of test for relation between classification accuracy and number of most informative feature words in case of Naive Bayesian Classifier

Results of test for relation between classification accuracy and number of feature words in case Level 3 of filtering and different classifiers

Results of test for relation between classification accuracy and number of most informative feature words in case Level 3 of filtering and different classifiers

Summary it is very important how the raw data is prepared before providing it to a classifier - some classifiers are less sensitive to information noises which are included in unfiltered data. Naive Bayesian Classifier provides decent accuracy with the lowest training set size and training effort. promising results using Maxent Classifier since it proved to be not sensitive to noise as in case of Naive Bayesian model, usage of additional features filtering methods and big amount of features can possibly give very good results it is clear that usage of Naive Bayesian Classifier with the 3rd input data filtering level and features filtering is probably the best method to be used for text data classification it was proved that it is possible to build a system that can be trained to recognize given phenomenon using Natural Language Processing and machine learning techniques we found additional techniques which helped us to improve classification accuracy using different classification models, we performed proved that additional techniques allowed to enhance accuracy of the classification for each type of the classification model. the most useful and efficient classifier seems to be Naive Bayes, since it combines high training speed, ability to work with small data sets and high classification accuracy. it is promising that methods created by us proved to give accuracy gain, it is important that the accuracy results which were gained during automatic classification of articles are much better than average classification for sentimental analysis done by the human. we conducted our research on one phenomenon, however it is theoretically possible to apply existing methodology and implementation to other phenomena. it would be very good to execute tests against other types of phenomena and check how classifiers and filtering methods behave within such conditions the suggested future usage of Support Vector Machine Classifier could give promising results since this classifier is popular in sentimental analysis and proved to be the best in terms of categorization accuracy further filtering algorithm enhancement is also possible. Filtering of raw text could be extended with additional procedures using additional linguistic elements such as adjectives and more sophisticated search of related words.