Multi-source Model of Heterogeneous Data Analysis for Oil Price Forecasting

This article sheds light on the question of whether it is possible to create fairly accurate forecasts of real oil prices. For this purpose, a multi-level machine learning model has been created to analyze several sources of heterogeneous data to predict future prices. The article uses different types of data: market condition data, titles


INTRODUCTION
Crude oil is a natural liquid fossil fuel contained in geological formations under the surface of the earth. It is mainly produced by oil drilling, which is performed after preliminary study of structural geology, analysis of sedimentary basins and reservoir characteristics. Crude oil is one of the most important energy resources on earth. Until now it remains the world's leading fuel, which accounts for almost a third of the world energy consumption. Crude oil prices are determined by many factors and have a great impact on the global environment and economy .
Due to the importance of the resource, scientists, private researchers and business experts have long been trying to understand the principles of oil price formation. On the one hand, it is necessary to equalize the determinants of supply and demand for oil and their subsequent impact on the price formation of this commodity. On the other hand, there are a number of exogenous factors, for example, atypical reductions or increases in production carried out by the organization of the OPEC oil-exporting countries, which generate structural breaks in time series, making them even more difficult to analyze. News shocks, the current interest of traders in trading the instrument and much more also have a significant impact.
In order to solve this problem, more and more machine learning methods have been used recently. Machine learning provides powerful computational tools and algorithms that can learn and make forecasts based on different data. In this article we offer a new approach to crude oil price forecasting based on a new paradigm of machine learning called learning with heterogeneous sets of factors.
The main advantage of the approach is that the forecast model can capture whole sets of changing factors affecting oil prices.
The results of the experiment show that our training model achieves the highest accuracy both in terms of the mean square error of forecasting and in terms of the ratio of direction accuracy at different time horizons of the forecast. It turns out that the essence of building this model comes down to finding the optimal This Journal is licensed under a Creative Commons Attribution 4.0 International License relationships between different types of data applicable to the analysis and forecasting of oil prices. Then, on the basis of the obtained links, it is necessary to set a forecast model capable of estimating 3 parameters at once: general state of markets; news background for a certain instrument, in case of research -prices for oil contracts; transaction history data.

LITERATURE REVIEW
Many researchers were interested in the possibility of forecasting the prices of financial instruments, as well as the factors that affect them . This section will briefly describe some work on related topics or issues raised.
In general, the debate over the possibility of predicting the future based on past data is unlikely to disappear . At present, researchers have learned how to handle computational errors and adjust predictive accuracy based on them quite well (Wu, 2019. In addition, special algorithms are already being created to solve prediction problems obtained from the subject areas of research (Oprea, 2020;Mikhaylov et al., 2020). All this makes it possible to draw a conclusion about the validity of the idea of algorithmic simulation of real processes.
Historically, the first works on the study of the issue used classical econometric methods. Thus, Amano (1987) was the first to propose a small-scale model for forecasting the world oil market. Later this topic was developed in numerous works under conditions of constant improvement of computing power and econometric approaches, up to modern models with AI (Zhao et al., 2019;Denisova et al., 2019;Mikhaylov, 2018a;Mikhaylov, 2018b), where SDAE-B ensemble training is used.
The proposed work also involves some of the most modern techniques, but they are sensitive to data, so it is worth turning to different types of information used in predictive articles. The methods of time series analysis are the most widespread and are based on the premise of interrelation of past and future. The most common methods of this approach are: ARIMA models (George et al., 2015;Nyangarika et al., 2019a;Nyangarika et al., 2019b;Nyangarika et al., 2018;Nie et al., 2020) and LTSM models (Zhiyong et al., 2017).
The analysis of historical transaction data includes indicators of price levels, trade volumes, indexes, various oscillators and indicators. Technical analysis is based on the principles of this approach, as well as many research papers have been written. Some simply combine the action of basic models to improve accuracy (Sadorsky, 2006). Based on this type of data, someone builds more complex prediction methods, such as nonlinear models ANN (Saeed and Faezeh, 2006) or use modern methods of processing such data, allowing to combine several variations of input data (Xiong et al., 2015;An et al., 2020). However, these works are mostly based on trade data, ignoring news factors, institutional changes and some other fundamental changes.
Thanks to the development of computing machines, current algorithms are even capable of evaluating text for economic forecasting (Semiromi et al., 2020;Atkins et al., 2018). It is quite obvious that the news background has a significant impact on price levels, which has been proven by many studies and even forecasting techniques (Pei-Yi et al., 2020;Mikhaylov, 2020a;Mikhaylov, 2020b). Assistance in financial forecasting was also offered by the researcher (Shynkevich et al., 2016;Mikhaylov, 2019;Dooyum et al., 2020;Gura et al., 2020).
Their work allowed to combine different categories of news to identify different degrees of impact on the achievement of target prices, through machine learning.
The closest method to the proposed approach would be to evaluate multiple sources of information simultaneously. This method implies inclusion of the above data types into the model. For example, Zhang et al. (2017), set up correlation matrices of quantitative and qualitative features as the main data structure of machine learning, which allowed to significantly improve accuracy.

Model Structure
As noted earlier, this work will propose a model that combines several independent approaches in forecasting and data analysis. Such approaches have been gaining popularity recently (Xu et al., 2020). According to the terminology of the previous section, the approach includes processing of 3 types of data. Schematically, the process of interaction with data and building a model is shown in Figure 1.

Data Set Introduction
As you can see from the scheme, the data processing levels are structured in horizontal blocks, and the different data used are separated into separate vertical branches. To avoid typical problems with homogeneous data models, it was decided to use information of different nature and structure (Daniel et al., 2020). Thus, for "Title data" were selected thematic news for the last few years, a total of 6304 headlines. "Transaction data" is taken from 2008 and includes historical data on oil prices and trading volume in different time frames: 4-h, day and week frames, it is necessary for the opportunity to "open" the chart for training the neural network. The final type of "Market condition data" contains data on individual large companies, price levels of related products (gasoline, fuel oil), oil industry indices of regions and different countries.
It is also worth noting that all 3 types of data used have an action time stamp: "Transaction" -short-term, "Title"medium-term, "Market condition" -long-term. The division is also valid, which can be proved by the theory of behavior of economic subjects. Thus, for example, based on real practice, short-term traders who trade within a horizon of a few minutes to a couple of weeks, use in their work mainly analysis of charts, candlesticks, trade volumes and other similar indicators (Zhongpei and Jun, 2019;An et al., 2020a;An et al., 2020b;An et al., 2019). Medium-term investors most often select companies according to the news background and reports, and their investment horizon to a year. Long-term investors, by type of institutional investor, are already assessing market interrelationships and the development prospects of entire industries, and their investment horizon is not limited. The data types used in the model have been distributed by the same logic. In addition, the ownership of data processed by the neural network is an additional weight distribution multiplier depending on the forecasting horizon (Jianzhou et al., 2020). For short-term forecasts, short-and medium-term category data are more significant, and similarly, the logic is implemented for other data types (Makumbonori et al., 2019).

Data Pre-processing and Input Data for ANN
After data collection, they need to be processed. Among the features of the "Market condition" category, it is necessary to establish the existing relationship between different economic instruments. For this purpose, it was decided to create a correlation matrix. It allows to reveal the presence of a linear relationship of the considered indicators, where the calculation is performed according to the formula 1: Where Xi and Yi are the values of growth of the considered pair of indicators in %; N -number of observations of each indicator; Thanks to this transformation, it was possible to create a sparse matrix, which includes indicators between which the relationship was stronger than 0.3 without taking into account the sign. Indicators that did not meet the required bond value were left in an additional lightweight matrix in the model to optimize network performance. A part of the original connection matrix is shown in Table 1.
As can be seen from the table, even in a small demonstrated part of the matrix there are enough signs with an acceptable connection strength. This means that the original hypothesis of data binding in the adjacent areas is correct. Due to this, it was possible to obtain a sufficiently stable data set reflecting the general state of the markets of interest.
Header analysis, or "text data of medium-term impact" is one of the key features of the model. Unfortunately, the task of understanding the text by computing has not yet been solved, so for the purposes of work it was necessary to translate the text into valid digital data (Zhenjing et al., 2016). For this purpose, the news headlines were processed in 3 stages. First, it is necessary to remove the "noise," which for the text are words-links, by the type of "the," "is," "a," as they do not carry a semantic load. After that, to reduce the number of features, the words themselves have been reduced to the initial form, for example, "caring" became "care," this process is called lemmatization. The final stage of the text pre-processing was the compilation of n-gram. These are structures for storage and perception of textual data, in the form of word combinations of 1-3 words, while maintaining the original order of use.
After the previous procedures with text, it should be presented as numbers to be submitted to the input of the neural network. The text vectorizing procedure is used for this purpose. It implies creation of numeric vectors corresponding to the previously obtained n-grams. First, a dictionary of combinations is created "learning" from the input text data, and then the vector representation of phrases based on contextual proximity is calculated (Xiaodong et al., 2020). In this case, combinations occurring next to each other in the text will have close numeric coordinates in the vector representation. The After receiving such a view, the only thing left is to combine vectors of words with data on price changes, according to the dates of news and price changes. The only nuance of this step was setting the time delay necessary for the market to react to the news. Empirically, the best flag was chosen as the period of 12 h, because during this time there was at least 1 h of trading on the most important exchanges, and the model showed the best results in tests (Feuerriegel and Gordon, 2019). In fact, it is n-gram at n=1, so for ease of perception, Table 2 shows the most significant words from the selected headings. The adjacent columns show the strength of the connection and its direction, where "+" word is often accompanied by a price increase, "−" respectively, on the contrary.
The last category of input data used is "Transaction data." The key tasks of these data pre-processing before submission to the neural network were standardization and increasing the number of nested data.
Standardization means bringing the data to a certain format and representation, which ensure their correct application in multivariate analysis. The purpose of standardization is to ensure the possibility of correct comparison of observation values . Standardization by the chosen method is carried out by formula 2: Where, z is a standardized feature; x -attribute value; u -average value of processed attributes; s -standard deviation of the feature.
The task of creating an opportunity to "unfold" data comes from the very concept of regression neural networks. They are excellent approximating algorithms, but are highly prone to retraining. To increase the correctness of algorithm work on new data, it ideally requires the possibility of infinite decomposition of each point and vector into small components. This would allow obtaining an unlimited number of patterns describing all possible variations of parameter changes. In real life, this is impossible. The proposed reality improvement for more correct neural network learning was the idea of nesting of basic trading time frames. Schematically, the principle is shown in Figure 2.
Weekly price changes were used to build the forecast directly. The two lower levels served as a "sweep" of the main and auxiliary

ANN Forecasting Model Description
The research uses a huge variety of different types of neural networks. Their regression variations were used for the purposes of this article (Gusev and Burkovskii, 2013;Yumashev et al., 2020).
In general, just like the human brain, all neural networks consist of a large number of related elements -neurons that mimic the brain (Grossi and Buscema, 2008). Figure 3 shows a schematic representation of the device of this model.
The figure shows that the artificial neuron, like the living neuron, consists of synapses that bind the inputs of the neuron to the nucleus; the neuron nucleus that processes the input signals and the axon that binds the neuron to the next layer. Each synapse has a weight that determines how much the corresponding neuron input affects its state. The state of the neuron is determined by the formula 3: Where n is the number of neuron inputs x i is the value of the i-th neuron input w i is the weight of the i-th synapse.
The obtained values pass the specified operations many times in real time -this is the principle of multilayer networks. Such a device makes it possible to establish more complex and multiple connections, but increasing the layers increases both resource and training time, so you cannot increase them endlessly (Hooman and Ebrahimi, 2020). The model we use is also multi-layer.
After internal data processing in the hidden layer's neurons, they are fed to the output layer through the activation function. Most often, it is a classical sigmoid calculated by the formula 4 (Wen et al., 2019).
Where a -is a specially calculated coefficient of "hollowness" of the function; x -is the argument submitted to the function.
A reverse propagation algorithm was used to achieve the model learning goals. In this approach, the error propagates from the output layer to the input layer, i.e. in the direction opposite to the direction of signal propagation during normal network operation. The learning algorithm itself can be written as a cycle of the following operations: 1. Supply of input data in the form of required images and definition of network outputs; 2. Calculation of input layer weights by formula 5: 3. Weights of the output layer are calculated by formula 6: 5. Correction of weights is carried out by the formula 8: 6. Completion of training if the condition of accuracy is satisfied.
In all formulas above: y i -is the output value of the i-th neuron; x i -is the value of the i-th neuron input; S j -weighted sum of input signals defined by formula (4); k -is the number of neurons in the n+1 layer; d i -is the target value of the i-th output; η -the parameter which defines the learning rate; n and t -numbers of individual layers; N -number of layers.

RESULTS
In accordance with the principles described earlier, we have developed a working model of oil price analysis and forecast. Its validity is evaluated by RMSE -E stands for Mean Squared Prediction Error, lower is more accurate; MAPE -stands for absolute percentage error, lower is more accurate; DAR -stands for Directional Accuracy Ratio, higher is more accurate, which are calculated by formulas 9-11: In all formulas above: y i -is the real price value at i moment of time; ŷ i -forecasted price value at i moment of time; n -number of forecast values compared to real data; d = 1 if (ŷ-yi-1) (yi-yi-1)>0 and d=0 otherwise.
The obtained model quality measures are shown in Table 3.
It is also easy to see the "Data" column in the table. It reflects the relevance of using a large number of heterogeneous data. Thus, according to the data in the table, increasing the number of data sources improves the predictive strength of the model. In addition, the data were divided into two parts in 80% for training and 20% for training and testing (Jian et al., 2019). The forecast itself is shown in Figure 4.
It turns out that the main assumption of the article proved to be correct. Indeed, it is possible to combine fundamentally different data types and structures in one model. In addition, the model can be based on incomplete data, for example, not every day there may be significant news on the subject of interest, but the forecast will still be made. Thus, it was possible to solve several problems of modeling economic indicators using mathematics, statistics and machine learning.
In particular, the proposed model is almost devoid of the following disadvantages:  Source: Author's calculations, Thomson Reuters • Overtraining. Heterogeneity of indicators and division of time periods does not allow you to remember a limited number of patterns and lose them with new data • Lack of information. In modern realities, the situation of complete absence of 3 different types of data on quite popular topics is hardly possible • Low application component. As can be seen from the data presented earlier, the proposed model works quite correctly and accurately in the real conditions of the modern market.

CONCLUSION AND DISCUSSION
The rapid growth of technology and globalization of financial markets, as well as the strategic importance of commodities such as oil, has increased the need for accurate and effective oil price forecasting. With the rapid changes in economic, political and social conditions in oil-producing and consuming countries, it has become difficult for forecasters to obtain the data necessary for effective forecasts (Nonejad, 2020). Time series have become very difficult to predict prices in financial markets, so researchers are looking for forecasting models with less input data and higher accuracy (Minggang et al., 2018).
This study was aimed at offering the best way to solve the identified range of problems. It is specific in that the initial hypothesis of the work is not just an opportunity to predict time series of these prices for economic instruments, but immediately to find the best correlation of data for this purpose (Ren, 2020). For this purpose, the used price forecasting model includes algorithms of taking into account structural states of markets, news background of the selected financial instrument and classical analysis of transaction data. In general, the results of the model can be considered quite acceptable, and the research contributes to the literature and study of the financial market.
In fact, this paper presents an artificial neural network model that solves the problem of determining the most informative connection between different types of data on oil prices. The forecast presented earlier was quite accurate both in classical model evaluation metrics and in direct consideration of forecast data and real data.
The main advantage of this study is that it was possible to turn off much more relevant information and thus better reveal hidden market mechanisms.
One of the limitations of this work remaining for further research is the development of an oil price forecasting model that takes into account an even greater number of economic parameters, such as the links between the actions of individual states or the policies of oil-exporting countries, and much more. With further improvement of the algorithms, it will be possible not only to improve the accuracy of price forecasts, but also to better understand the fundamental prerequisites for the behavior of some subjects of the world economy. In a certain approximation, it will even allow political sciences to develop, since oil is the world's strategic resource and understanding the structure of its value can shed light on the actions of authorities and large firms.