Abstract This master thesis explains about applying statistical and artificial intelligence techniques (Bayesian Networks) to monitoring, validation for data processing system. Data processing contains many process variables, and operators faced with the tasks of monitoring, control, and diagnosis of these processes, analyse current states, validate processes and diagnose process failures, or take appropriate actions to control the processes. This thesis may help for persons who maintain such a complex critical system. Comprehensive Nuclear Test Ban Treaty Organization (CTBTO) is designed a verification regime for detect any event around the world. The International Monitoring System (IMS) of CTBTO currently consist of around 280 facilities worldwide to monitor the planet for signs of any natural and manmade events. These facilities contain seismometers, which is an instrument that converts ground motion into electric voltage. Different types of seismometers are used at seismic stations. The IMS uses the following three types of technologies to receive continuous data: Seismic network, Hydroacoustic network and Infrasound network. International Data Centre (IDC) located at the headquarters of the CTBTO in Vienna, Austria is being received data continuously from primary stations in IMS network. Then, these data are stored in the file system and data are passed through a number of automatic analysis stages. Detection and Feature Extraction (DFX) and Global Association (GA) are two most important stages in data processing pipeline. DFX applications perform a variety of tasks. Their primary functions are to make detections and to measure features from waveforms. DFX processes data from all three waveformbased technologies (seismic, hydroacoustic, and infrasonic). GA is the process in the automatic pipeline that forms event hypothesis. GA reads arrival and amplitude data for a time interval and forms set of associations using an exhaustive search algorithm. These association set define the events, which then are located and have their magnitude estimated. The data processing pipeline has been subjected to various changes and upgrades. Most common changes are installing new station to pipeline, reconfiguration of existing stations, software changes, operating system upgrades and processing parameter changes. After these changes, it is required to validate results of processing and statistical methods such as null hypothesis testing can be carried out. Hypothesis testing is a method for testing a claim about a parameter in a population, using data measured in a sample. In this method, we test some hypothesis by determining the likelihood that a sample statistic could have been selected, if the hypothesis regarding the population parameter were true. To begin, we identify a hypothesis or claim that we feel should be tested. For example, after reconfiguration of seismic station, it might be wanted to test that mean of detections per hour. The estimate number is identified with previous years data. Then, we select a criterion upon which we decide that the claim being tested is true or not. For example, the claim is that this station detects more than `N number of detections per hour. Most samples, which are selected should have a mean close to or equal to `N number of detections per hours if this testing to be true. Random samples from the population are selected and measure the sample mean of them. Then, compare what we observe in the samples to what we expect to observe if the claim we are testing is true. The null hypothesis (H0), stated as the null, is a statement about a population parameter, such as the population mean, that is assumed to be true. The null hypothesis is a starting point. We will test whether the value stated in the null hypothesis is likely to be true. An alternative hypothesis (HA) is a statement that directly contradicts a null hypothesis by stating that that the actual value of a population parameter is less than, greater than, or not equal to the value stated in the null hypothesis. The alternative hypothesis states what we think is wrong about the null hypothesis. Level of significance refers to a criterion of judgment upon which a decision is made regarding the value stated in a null hypothesis. The criterion is based on the probability of obtaining a statistic measured in a sample if the value stated in the null hypothesis were true. In behavioural science, the criterion or level of significance is typically set at 5%. When the probability of obtaining a sample mean is less than 5% if the null hypothesis were true, then we conclude that the sample we selected is too unlikely and so we reject the null hypothesis. In this master thesis, level of significance is set to 5% for events such as station parameter changes because the stations detections depend on seismicity of the earth during testing period. For software updates, it might be used significance level of 1% because it compares same data in two environments (Testing and Runtime). When we decide to reject the null hypothesis, we can be correct or incorrect. The incorrect decision is to reject a true null hypothesis. This decision is an example of a Type I error, which is the probability of rejecting a null hypothesis that is actually true. Researchers directly control for the probability of committing this type of error. Other option is that the correct decision is to retain a true null hypothesis. The incorrect decision is to retain a false null hypothesis. This decision is an example of a Type II error, is the probability of retaining a null hypothesis that is actually false. In the artificial intelligence techniques section of master thesis, Bayesian network is explained. AI systems shall have to cope with uncertainty, that is, they shall have to deal with incomplete evidence leading to beliefs that fall short of knowledge, with fallible conclusions and the need to recover from error, called nonmonotonic reasoning. Nevertheless, the AI community has been slow to recognize that any serious, generalpurpose AI will need to be able to reason probabilistically, what we call here Bayesian reasoning. Bayesian reasoning is a kind of probabilistic reasoning. Probability is of two main types. Prior Probability: This Probability is also popularly known as unconditional probability. It is probability assigned to an event in the absence of knowledge supporting its occurrence or absence. Posterior Probability: This type of probability is also known as conditional probability. It is the probability of an event after evidence, for example: the probability when some evidences supporting or negating the outcome are known. The Bayesian network has been used by many systems for the development of diagnostic systems. Bayesian networks are successfully applied to a variety of applications such as machine diagnosis, robotics, data mining and natural language interpretation and planning. A Bayesian network is used to model a domain containing uncertainty in some manner. It is a graphical model for probabilistic relationships among a set of variables and is composed of directed acyclic graphs (DAGs) in which the nodes represent the random variables of interest, and the links represent informational or causal dependencies among the variables. Each node contains the states of the random variable and it represents a conditional probability table. The conditional probability table of a node contains probabilities of the node being in a specific state given the states of its parents. Furthermore, edges relations within the domain. These effects are normally not completely deterministic (e.g. disease > symptom). The strength of an effect is modelled as a probability. Master thesis explains how to use Bayesian networks as a diagnostic support tool for DFX and GA application failures. In the DFX processing failures Bayesian network, three intermediate cause nodes were identified. (1) Parameter Changes Errors: DFX application uses application parameter files and these values in the parameter file are modified due to software changes and SHI station changes. DFX application needs station related processing parameters for processing. IMS stations are subjected to various upgrades and it results updating station specific parameter values in station parameter files and there may be errors in these updates. When installing new SHI stations, it is required to generate new station specific files and update existing shared parameter files. (2) Corrupt Data: SHI stations send corrupted data to IDC due to failures at stations or data transmission problems. DFX processing may fail while processing these corrupt data. (3) Software Issues: Installing DFX application patches, upgrading OS system and upgrading middleware (Distributed Transaction Processing) activities apply to processing pipeline. These activities can cause to DFX processing failures. Two main evidence nodes were identified. (1) Failed Interval: DFX processing executes on 10min interval of station data. When there is an error in DFX processing, the status of processing interval will be Failed. These failed intervals can be observed in special workflow. (2) No Detections: The primary functions of DFX processing are to make detections and to measure features from waveforms. If there are processing failures in DFX, there will no detections. Looking at database entries, it is possible to check number of detections for each station or each interval. In Bayesian networks, there are 4 popular inferences. (1) Backward inferences is also called diagnostic inference (from effects to causes). The inference reasons from symptoms to cause and the backward inference executes the evidence backward propagation, (2) Forward inferences is also called predictive inference (from causes to effects). The inference reasons from new information about causes to new beliefs about effects and the forward inference executes with the evidence forward propagation, (3) Intercausal inferences is also called explaining away (between parallel variables). The inference reasons about the mutual causes (effects) of a common effect (cause), (4) Mixed inferences is also called combined inference. In complex Bayesian networks, the reasoning does not fit neatly into one of the types described above. Some inferences are a combination of several types of reasoning. Using simple version of DFX failures Bayesian network, these calculations has been explained in the Master thesis. For complex models in Bayesian networks, there are singleconnected networks, multiple connected, or event looped networks. To solve much complex network, it is necessary to simply it into Polytree. Polytrees have at most one path between any pair of nodes; hence they are also referred to as singlyconnected networks. Polytree algorithm can be applied to Polytrees, as name suggests. It is significance due to number of reasons: It is one of the first algorithms for inference in Bayesian networks, It gives a cognitive dimension to its computations as it can attribute a specific probabilistic meaning to each of the sub computations it performs, It is the basis for a more general class of algorithms, known as conditioning algorithms, which apply to arbitrary Bayesian networks. It is the basis of an influential class of algorithms for approximate inference in Bayesian networks.
