Pre-Analysis Data Screening

Use career-a. sav file in career-a.sav. This guides you through the chapter relatively easy. Remember the page #s quoted below might be different depending on your textbook’s edition.

Chapter 3, Pre Analysis Data Screening is by far one of the most important chapters within the text. It is a requisite before any statistical analysis can be done in the ensuing chapters. It is a must read and understand. The Chapter’s topic gives a clear idea of what to expect. It is in literary analogy the statistical version of “meta cognition.” So instead of awareness and understanding of one’s own thought processes, in “statistical meta cognition” we are analyzing raw data before we actually conduct analysis on our data. A Statistics professor Dr. McCormick compared pre-analysis Data Screening to a “Beauty Contest.” He said pre analysis data screening is like the preliminary section of a beauty pageant. “Before they choose the top 10 contestants they go through all 50 contestants and ask questions to eliminate those that do not meet up to standards and meet up to set criteria.” (McCormick, 1998) Similarly, researchers must consider several important issues before he or she wishes to subject data to multivariate analysis. Issues must be carefully considered, analyzed and addressed prior to the actual statistical analysis; only after these quality assurance issues have been examined can the researcher be confident that the main analysis will be an honest one, which will eventually result in effective conclusions derived from the data. To better understand how to screen data along with the reasons why it is done please go to the website then eResource, and download career-a.sav. This guides you through all SPSS practical aspects of the chapter relatively easily. career-a.sav

Why Screen Data?

There are four main purposes for screening data prior to conducting a statistical analysis.

The first purpose deals with the accuracy of the data collected. The results of any statistical analysis are only as effective and reliable as the data analyzed regardless of the data collected. If inaccurate data are used, the researcher will not be able to distinguish the extent to which the results are valid simply by examining the output. Unknown to the researcher, there will be flawed conclusions because they will have been based on the analysis of inaccurate data. If small data is actually collected it is relatively simple for the researcher to proofread it against the raw data. This can be accomplished by using SPSS List procedure. This by most means is a very accurate way to test for efficiency. However if the data set is rather large, this process would be overwhelming. Therefore, examination of the data using frequency distributions and descriptive statistics would be a more accurate method. This can be obtained using the SPSS frequency procedure. (Both these sections are carefully explained on page 36 of the text and outlined in figure 3.41 and 3.42 on page 63 through 66).
The second purpose deals with missing data and efforts to assess the effects of incomplete data.
Missing data may occur because of 1) equipment failure, 2) incomplete processes by participants, or 3) data entry errors. Even though errors do occur within research many researchers fail to understand the gravity errors have on the overall outcome of the research. The best thing to do when a data set includes missing data is to examine it. Using data that are available, a researcher should conduct tests to see if patterns exist in the missing data. As seen using SPSS on page 38 through 42 of the text there are several ways to pre analyze data screening. Using the website career-a.sav, the guide helps you to actually practice specific examples of the information.

The first of these alternatives involves deleting the cases or variables that have created the problems. Any case that has a missing value is simply deleted from the data file. If only a few of the cases have missing values, this is a plausible alternative. However when the cases are numerous then another alterative may be considered. If the situation is where the missing values may be concentrated to only a few variables, then deleting the entire variable from the data file may be an option considering that it does not affect the overall analysis and results. (Refer to page 38 for an example as to how to use SPSS to complete activity. Refer to page 63 and use the missing data examination and process to conduct a practical test of missing data.)

A second alternative to handling missing data is to estimate the missing values and then use these values during the main analysis. There are numerous main methods of estimating missing values. The first of these is for the researcher to use prior knowledge, or a well educated guess for a replacement value. This method should be used only when a researcher has been working in the specific research area for some time and is very familiar with the variables and the population being studied. Another method of estimating missing values involves the calculation of the means, using available data, for variables with missing values. (Page 38, 39 and page 59 of the text provide five options of available SPSS options where means can be used.)

Finally, a third alternative for handling missing data is by using a regression approach (usually used for ungrouped data) In regression, several independent variables are used to develop an equation that can be used to predict the value on a dependent variable. An advantage to this procedure is that it is more objective than a researcher’s guess (as seen in the previous alternative). Using regression factors is more information than simply inserting the overall mean. A disadvantage is that the predicted scores are not a realistic perception of what they should be. Another disadvantage of regression is that the independent variables must be good predictors of the dependent variable in order for the estimated values to be accurate.

The third purpose deals with assessing the effects of extreme values known as “outliers” on the analysis. There are three fundamental causes for outliers: 1) data-entry errors were made by the researcher, 2) the subject is not a member of the population for which the sample is intended, or 3) the subject is different from the remainder of the sample (Tabachnick & Fidell, 2007). The problem with outliers is that they can distort the results of a statistical test. This is due largely to the fact that many statistical procedures rely on squared deviations from the mean (Aron, Aron & Coups, 2006). Statistical tests are quite sensitive to outliers. An outlier can exert a great deal of influence on the results of a statistical test. A single outlier, if extreme enough, can cause the results of a statistical test to be significant or insignificant depending on the extreme and can seriously affect the values of correlation coefficients.

Outliners can exist in both univariate and multivariate situations, among dichotomous and continuous variables, and among independent variables as well as dependent variables (Tabachnick & Fidell, 2007). Univariate outliers are cases with extreme values on one variable; multivariate outliers are cases with unusual combinations of scores on two or more variables. With data sets consisting of a small number of variables, detection of univariate outliers can be relatively simple. It can be accomplished by visually inspecting the data, either by examining a frequency distribution or by obtaining a histogram or stem & leaf as noted in SPSS to look for unusual values. One would simply look for values that appear far from the others in the data sets. (Refer to pages 39 through 44 for SPSS application of outliners)

Univariate outliers can also be detected through statistical methods by standardizing all raw scores in the distribution. This is most easily accomplished by transforming the data to z-score. If a normal distribution is assumed, approximately 99% of the scores will lie within three standard deviations of the mean. Univariate outliers can also be detected using graphical methods (Tabachnick & Fidell, 2007) as seen in Figure 3.2 on page 31 of the text and pages 39 through 44 and page 59). Multivariate outliers are more subtle and therefore, more difficult to identify, especially by using any of the previously mentioned techniques. The procedure known as Mahalanobis distance (Links to an external site.) is what is used. (Use the hyperlink to read more about it. Quite interesting!)

The fourth purpose of screening data is to assess the adequacy of fit between the data and the assumptions of the specific procedure. Some multivariate procedures have unique assumptions upon which they are based. Noteworthy though is that most, if not all techniques include three basic assumptions: normality, linearity and homoscedasticity. These assumptions aid in the test for Robustness. (Refer to page 45 through 52 and page 59 to note SPSS application of normality, linearity and homoscedasticity.)

(1) The first of these assumptions is that of a normal sample distribution. Prior to examining multivariate normality, one should first assess univariate normality. Univariate normality refers to the extent to which all observations in the sample for a given variable are distributed normally. Graphical methods through plotting are usually the most appropriate methods to use for normality. This gives an indication as to whether or not normality might be violated. Among the statistical options for assessing univariate normality are the use of skewness and kurtosis. A variable can have significant skewness, significant kurtosis or both. Another specific statistical test used to assess univariate normality is the Kolmogorov-Smirnov statistic, with Lilliefors significance level. This statistic tests the null hypothesis that the population is normally distributed. A rejection of this null hypothesis based on the value of the statistic and associated observed significance level serves as an indication that the variable is not normally distributed. Multivariate normality refers to the extent to which all observations in the sample for all combinations of variables are distributed normally. Similar to the univariate examination, there are several ways both graphical and statistical, to assess multivariate normality. Since univariate normality is a necessary condition for multivariate normality, it is recommended that all variables be assessed based on values for skewness and kurtosis. If the researcher determines that the data has substantially deviated from normal, then they can consider transforming the data. Data transformations involve the application of mathematical procedures to the data in order to make them appear more feasible. Once data has been transformed, all other assumptions have been met; the results of the statistical analyses will be more accurate. After transformation of data, it is vital to reevaluate the normality assumption.

(2) The second assumption linearity, presupposes that there is a straight line relationship between two variables. These two variables can be individual raw data variables or combinations of several raw data variables. The assumption of linearity is important in multivariate analyses due to the fact that many of the analysis techniques are based on linear combinations of variables. There are essentially two methods of assessing the extent to which assumption of linearity is supported by data. The first method in an analysis that involves predicted variables, nonlinearity is determined through the examination of residual plots or prediction errors. The second method of assessing linearity is accomplished by inspection of bivariate scatterplots. If both variables are normally distributed and linearly related, the shape of the scatterplot would be elliptical. If one of the variables is normally distributed, the relationship will not be linear, and the scatterplot between the two variables will not be oval-shaped. Assessing linearity by means of bivariate scatterplots is an extremely lengthy procedure, at best. The process can become even more cumbersome when data sets with numerous variables are being examined.

(3) The final assumption is the assumption of homoscedasticity. Homoscedasticity is the assumption that the variable in scores for one continuous variable is roughly the same as all the values of another continuous variable. This concept is equivalent to the univariate assumption of homogeneity of variance. In the univariate case, homogeneity of variance is assessed statistically with Levene’s test. This statistic provides a test of the hypothesis that the samples come from populations with the same variances.

While numerous in many ways, and a bit cumbersome for the researcher, it is imperative to pre analyze data before statistical analyses are used in the test. One of the best ways to pre analyze is by data screening. SPSS offers a very descriptive step by step process noted on pages 63 through 66 that aids in data screening. Using these procedures will not only help the researcher to be confident that the main analysis will be an honest one, it also aids in effective conclusions derived from the data.

Discussion Questions: – You must answer all discussion questions – Always. Compare your answers to peer’s and discuss/correct differences (All chapters).

(1) What is the difference between univariate analyses and multivariate analyses in data screening?

(2) Why do missing values occur? Why would missing data be a problem? How do identify missing value using the SPSS procedure?

(3) What is Robustness? How would one identify normality, linearity and homoscedasticity? If the data represents a non- normal distribution what would be the alternate procedures that researchers should take?

(4) This is a list of SPSS procedures mentioned in this chapter, for testing or transforming data. For the listed cases, indicate the page # in the textbook. (and Learn them!).

(Textbook Edition – (_)

How to recode variables as missing values (Pages )
How to detect Missing Values () How to Replace Missing Values ()
How to Identify Outliers ()
How to Replace Outliers with accepted minimum or maximum value by conducting Recode (
How to Transform a variable using Compute () How to Examine normality for quantitative variable within each group ()
How to Examine homogeneity of variances between/among groups ()
How to examine quantitative variables together by group for outliers (_)
How to Examine normality and linearity of variable combinations by group () How to Examine homogeneity of variance-covariance between/among groups (
How to replace missing values with estimated values by conducting Transform () How to examine outliers for quantitative variable within each group (How to conduct Regression to test Mahalanobis distance) (_)
How to replace a small to moderate number of outliers with accepted minimum value by conducting Recode ()
How to Examine for normality for quantitative variable within each group (
How to transform a variable using Compute () How to Examine quantitative variables together for outliers (
How to examine normality and linearity of variable combinations () How to examine standardized residuals to predict values ()

Univariate Example with Ungrouped Data
Missing Data and OUtliers, () Linear Regression, ()
Normailty, Linearity, and Homoscedasticity, () Chart Builder, ()
Linear Regression: PLots, () Grouped Data for Multivariate Analsysis Missing Data and OUtliers, ()
Normality, Linearity, and Homoscedacity, ()
Scatterbox Matrix, () Multivariate Options, (_

Sample Solution