Posted: September 18th, 2017
Introduction to Bivariate Regression Analysis
In the previous lesson, we introduced the idea of measuring the strength and direction of relationships between two nominal or ordinal variables in cross-tabulation tables using lambda, Cramer’s V, gamma and Kendall’s tau-b. This lesson focuses on measuring the strength and direction of relationships between two interval-ratio variables using bivariate regression and correlation. We will also examine multiple regression, a technique for measuring the strength and direction of relationships among two or more interval-ratio independent variables and one interval-ratio dependent variable. Multiple regression analysis is probably the most widely used statistical analysis technique in social science and criminal justice research. Without understanding multiple regression analysis, it is difficult to read and understand professional criminal justice research.
Upon completing this lesson you will be able to:
So far, we have covered two or three chapters each Module. This Module we will focus on just one chapter. The first lesson focuses on the material on regression. The second lesson, focusing on the correlation coefficient and introducing multiple regression, is substantially shorter than Lesson 1.
Consider the different possibilities for measuring relationships between variables measured at different levels. For simplicity we will focus on nominal and interval variables. The table below presents four cells. Each cell corresponds to the level of measurement of independent and dependent variables.
The top left cell represents the situation in which you have both independent and dependent variables measured at the nominal level. As you learned in the previous lesson, in this case you use lambda.
The top right-hand cell represents the situation in which you have interval-ratio independent variables and nominal dependent variables, but the most widely used technique—logistic regression—is beyond the scope of this course. What about the situation in which you have a nominal independent variable and an interval-ratio dependent variable? Chapter 12 in the text covers analysis of variance (ANOVA). We will not have time to cover this topic here, but if you are interested, you should be able to learn the procedure on your own. When you have interval-ratio level independent and dependent variables, you use regression analysis, the topic of this lesson. We will also examine correlation, which is closely related to regression. The kinds of questions that we use regression and correlation to answer can be illustrated by general examples.
We can also state specific research questions that we use regression and correlation to investigate.
We could use regression and correlation to test a criminal justice-related research question:
Does increasing the number of police in a city lead to lower rates of violent crime?
Read Chapter 13 in your textbook before continuing with this lesson. Skip the section on “Testing the Significance of R2…”on pages 441-443. However, be sure you do not skip the beginning of the section on “Pearson’s Correlation Coefficient ®” at the bottom of page 443. By the way, there are lots of numbers and what appear to be daunting formulas in this chapter, but try not to worry too much. The underlying ideas are straightforward and the formulas are not as difficult as they appear at first sight.
In bivariate regression analysis, we often start by constructing or generating a scatter diagram (or scatterplot) to see whether the two variables are related to each other.
|Scatter Diagrams (Scatterplot)|
|A visual method used to display a relationship between two interval-ratio variables.|
|Typically, the independent variable is placed on the X-axis (horizontal axis), while the dependent variable is placed on the Y-axis (vertical axis).|
GNP, Gross National Product, is a measure of a nation’s wealth. Some researchers hypothesize that as GNP increases, the percentage of a country’s population willing to pay more to protect the environment will increase. The International Social Survey Programme collected data on 16 countries in its 2000 survey. Do the data seem to be consistent with this hypothesis? Place the values for each country in the scatterplot by dragging the name of the country to the appropriate place. Check your answer with the professionally drawn scatterplot.
|Country||GNP per Capita||Percentrage Willing to
Pay Higer Taxes
|Adapted from Faknfort-Nachmais& Leon-Guerrero (2011), 451.|
In the scatter diagram, GNP per capita is the independent variable X, which is placed along the X-axis (horizontal axis). Percentage willing to pay more to protect the environment is the dependent variable Y. It is placed along the Y-axis (vertical axis). The table and scatter diagram indicate that there is a tendency for countries with lower GNP per capita to be lower in willingness to pay more to protect the environment. There is a trend for countries with a higher GNP per capita to be willing to pay more. This indicates a positive relationship. There are exceptions, like Chile and Japan, for example.
We can examine the relationship between GNP per capita and a different dependent variable, “percentage that view nature and environment as sacred.” There is a negative relationship between these two variables. Countries with higher GNP seem less likely to view nature as sacred, and the opposite is true for countries with lower GNP.
Linear Relations and Prediction Rules
Scatter diagrams are an excellent first step in examining relationships between two interval-ratio variables, but they are only a first step. We can go on to determine if the relationships between the two variables are linear relationships.
|Linear Relationship||A relationship between two interval-ratio variables in which the observations displayed in a scatter diagram can be approximated with a straight line.|
|Deterministic (perfect) Linear Relationship||A relationship between two interval-ratio variables in which all the observations (the dots) fall along a straight line. The line provides a predicted value of Y (the vertical axis) for any value of X (the horizontal axis).|
We can use a scatter diagram to illustrate that there are linear relationships between the variables, but we will often see that these are not deterministic linear relationships. The figure below shows how the scatter diagram that you constructed on the previous page looks when we superimpose a straight line on the data.
We can use the line to predict the percentage willing to pay higher prices for any value of GNP. Using this straight-line graph, we can see that the predicted value of the percentage willing to pay higher prices at a GNP of $20,000 is 40.
Constructing Straight-Line Graphs
Looking at the straight-line graph on the previous page, we can see that none of the nations studied lies exactly on the line (though the U.S. is pretty close). This prompts us to ask several key questions:
These are fundamental questions for regression analysis. To understand these questions we need to consider how to construct straight-line graphs.
Because the relationships we study are never perfect, or deterministic, we do not expect to find our data points falling along one line. We need to find the best line to use to summarize our data and allow us to predict values of the dependent variable from the values of the independent variable. We need to choose the best-fitting line. We define the best-fitting line as the line that produces the least error, or minimizes error.
To understand what we mean by minimizing error, read pages 422-424 in your textbook. Now let’s look again at the original scatterplot for GNP per capita and percentage willing to pay to protect the environment, and walk through the process to find the best-fitting line.
|metcj702_W04S01T05_bestfit is displayed here|
The best-fitting line—the line that minimizes the sum of the squares errors—is called the least squares line. It is also called the best-fitting or ordinary least squares (OLS) line. The technique for producing the line is called the least squares or ordinary least squares method.
Using this method, we derive values of a andb for the linear regression equation that will produce the smallest amount of error. That equation will minimize Σe2.
Let’s review to this point:
You should memorize and be sure you understand these essential concepts.
Computing a and b
We can now compute a and b for the prediction equation. Although the material on pages 424-425 in the textbook may look daunting, it is explained thoroughly. First we need the formulas for b, the slope, and a, the Y intercept.
|Estimating the Slope: b|
|The bivariate regression coefficient or the slope of the regression line can be obtained from the observed X and Y scores.|
These formulas require that we have the values of the covariance and the variance of X. You learned about the variance in the lesson on measures of variability. The covariance measures how X and Y vary together. “A Closer Look 13.1” in the textbook (p. 429) provides a clear explanation of the covariance.
|Covariance and Variance|
|Covariance of X and Y – A measure of how X and Y vary together. Covariance will be close to zero when X and Y are unrelated. It will be greater than zero when the relationship is positive and less than zero when the relationship is negative.|
|Variance of X – We have talked a lot about variance in the dependent variable. This is simply the variance for the independent variable.|
Let’s go through the process for calculating a and b using the data about GNP and national willingness to pay more to protect the environment that we used to look at scatter diagrams earlier in this lesson.
|metcj702_W04S01T05a_ab is displayed here|
You can now use the prediction equation to find the predicted value of the dependent variable—willingness to pay for environmental protection—for a country with any value on the independent variable, GNP per capita. Whenever we have two points on a scatter diagram we can draw the best-fitting line, as shown in your text on page 428.
Interpreting a and b
The b coefficient, which is the slope, is 0.63 percent. This means that, in this data set, for every additional $1,000 in GNP per capita, the percentage of citizens of a country who would pay higher prices for environmental protection will increase by 0.63 percent. Keeping in mind that the relationships between variables in social science are inexact, the regression equation gives us a tool by which to make the best possible guess about how a country’s GNP per capita is associated, on average, with the willingness of its citizens to pay higher prices for environmental protection. The slope is the estimate of this relationship.
The Y intercept a is the predicted value of Y when X = 0. Often in a data set there will be no case that has a value of 0. According to our example, Latvia has the lowest GNP per capita at $2.42. When none of the cases has a value of 0 on the independent variable, you need to be cautious in interpreting a.
Let’s summarize what you have learned about the bivariate regression line.
|Properties of the Regression Line|
|Represents the predicted values for Y for any and all values of X.|
|Always goes through the point corresponding to the mean of both X and Y.|
|It is the best-fitting line in that it minimizes the sum of the squared deviations.|
|Has a slope that can be positive or negative.|
Keep this summary in mind as we delve further into bivariate regression and correlation.
The “Statistics in Practice” section on pages 429—432 in the textbook walks you through the bivariate linear regression of median household income and criminal behavior. Go through that section carefully, but I want to take you through an additional example to tie together the various pieces of regression analysis.
Let’s consider the possibility of a relationship between education, as an independent variable X, and occupational prestige, as a dependent variable Y.
|metcj702_W04S01T05b_interpret is displayed here|
Remember that we use regression analysis to predict values of the Y variable. Reflect on what you know about PRE measures and think about age and skateboards for a moment! What do age and skateboards have to do with education and occupational prestige? Read on and you’ll see!
In bivariate linear regression analysis, we try to find the best-fitting line to depict the relationship between two variables. You already know the formula for a straight line.
And you already know the bivariate linear regression equation.
|Bivariate Linear Regression Equation|
|Y-intercept (a)||The point where the regression line crosses the Y-axis, or the value of Y, when X=0|
|Slope (b)||The change in variable Y (the dependent variable) with a unit change in X (the independent variable).|
|The estimates of a and b have the property that the sum of the squared differences between the observed and predicted Σ(Ý-Y)² minimized using ordinary least squares (OLS). Thus, the regression line represents the Best Linear and Unbiased Estimators (BLUE) of the intercept and slope.|
Remember that our goal is to choose the best-fitting line to minimize the errors, the distances of each case from the regression line. More technically, the estimates of a andb will minimize the sum of the squared differences between the observed and predicted values. Thus, the regression line represents the best linear unbiased estimators (BLUE).
Statistical Analysis Programs
We will not be using SPSS in this course, but look at the figure below for the type of output produced by SPSS.
|metcj702_W04S01T05b_spss is displayed here|
Different statistical analysis programs may use slightly different terms, but it is generally clear where you can find a andb. Now that we have the regression equation, we can interpret it.
|Interpreting the Regression Equation|
|If a respondent had zero years of schooling, this model predicts that his occupational prestige score would be 6.120 points.|
|For each additional year of education, our model predicts a 2.762 point increase in occupational prestige.|
Assessing the Accuracy of Predictions: The Coefficient of Determination
Figure 13.6 on page 428 in the textbook displays the regression line for median household income (X) and Percentage of State’s Residents with a Bachelor’s degree (Y). Figure 13.8 on page 435 shows the predicted value of Y for New York. The authors ask us to consider the following situation.
Suppose we didn’t know the actual Y, the percentage of residents of New York who have a bachelor’s degree. Suppose further that we did not have knowledge of X, New York’s median household income (Frankfort-Nachmias and Leon-Guerrero 2015, p. 434).
What would we do in this situation? Do you have any ideas? You would use the mean value as your estimate. That would be 28.08%. When we know median household income, we can use the regression equation to estimate New York’s percentage of residents with a bachelor’s degree. Using the equation on page 435, we would predict 29.27% of New York residents have bachelor’s degrees.
We have improved the prediction by 1.19%. This is obviously an improvement, but we really want to see how much we improve the predictions for all cases.
Because we already have the total sum of squares, we can create the regression sum of squares, or SSR, as indicated in the following slide.
Now we can put all this together and consider the coefficient of determination (r²), which calculates the two measures of error for all cases in a particular research problem and gives us an overall idea of how much we reduce error by using the linear model. It is a PRE measure.
|Coefficient of Determination (r²)|
|A PRE measure reflecting the proportional reduction of error that results from using the linear regression model.|
|The total sum of squares (SST) measures the prediction error when the independent variable is ignored.|
|The error sum of squares (SSE) measures the prediction errors when using the independent variable and the linear regression equation.|
The coefficient of determination in this case is 0.83. This indicates that in this example we reduce our prediction error by 83 percent. It also shows that the independent variable, median household income, explains 83% percent of the variation in the dependent variable, percentage with a bachelor’s degree. The pie chart, Figure 13.9 on page 439 in the textbook, illustrates one way to think about r². Because the independent variable explains about 83 percent of the total variation in the dependent variable, that leaves about 17 percent of the variation to be explained by other factors.
You can also calculate r² with this simpler formula.
Upon completing this lesson you will be able to:
Bivariate Correlation: Pearson’s Correlation Coefficient (r)
The coefficient of determination (r²) is obviously useful. The square root of r², Pearson’s correlation coefficient (r), is the measure of association most often used for measuring the strength and direction of the relationships between two interval-ratio variables.
|The Correlation Coefficient|
|Pearson’s Correlation Coefficient (r): The square root of r². It is a measure of association between two interval-ratio variables.|
|Symmetrical Measure: no specification of independent or dependent variables.|
|Ranges from -1.0 to +1.0. The sign (±) indicates direction. The closer the number is to ±1.0, the stronger the association between X and Y.|
The following scatter diagrams indicate what the relationships would look like for r values of 0.00, +1.00, and -1.00.
Figure 13.10 on page 440 in the textbook demonstrates what the scatter diagrams would look like for specific values of r.
The study of bivariate correlation concludes with two “Statistics in Practice” sections. The first examines the relationship between social inequality and teen pregnancy, using unemployment rates as an indicator of inequality. The second section investigates the relationship between education and annual income of married and single women. Read through both examples to give you practice in regression analysis. Note that the last paragraph of each section starts with a similar sentence:
The authors point out that a complete analysis of teenage pregnancy would require consideration of other socioeconomic indicators, such as poverty rates, welfare policies, and expenditures on education. A more complete analysis of factors affecting earnings would include occupation, seniority, race/ethnicity, and age. The next statistical technique we will examine—multiple regression analysis—allows us to take additional independent variables into account.
Multiple Regression Analysis
|We examine the effect of two or more independent variables on the dependent variable.|
|The calculations are easily accomplished using SPSS or other statistical software.|
|The general form of the multiple regression equation (two independent variable):|
Multiple regression is an extension of bivariate linear regression. The multiple regression equation is an extended version of the bivariate linear regression equation. It has only one Y-intercept (a), but has as many values for slopes (b) as there are independent variables.
|Ŷ||we examine the effect of two or more independent variables on the dependent variable.|
|X1||score on independent variable X1.|
|X2||score on independent variable X2.|
|a||Y-intercept or the value of Y when both X1and X2are equal to zero.|
|b1||change in Y with a unit change in X1when X2 is controlled.|
|b2||change in Y with a unit change in X2 when X1 is controlled.|
Multiple regression enables us to answer numerous social science and criminal justice research questions when we need to separate out the effects of multiple independent variables and to assess their combined impact. Multiple regression can help in the following examples:
We can also use multiple regression to extend our analysis of teen pregnancy rates. In the section on bivariate analysis, we used the independent variable unemployment rate as an indicator of social inequality, but we noted that for a complete analysis of teen pregnancy rates we would need to incorporate additional independent variables into the analysis. One such variable is expenditures on education.
Now we can use multiple regression analysis to extend our original bivariate analysis by adding education expenditures as an additional independent variable. Note the hypothesis and the regression equation for this example.
|The hypothesis: The higher the state’s expenditure, the lower the teen pregnancy rate. The higher the state’s unemployment rate, the higher the teen pregnancy rate.|
|The multiple linear equation is:|
|Ŷ||= teen pregnancy|
|X1||= unemployment rate|
|X2||= expenditure per pupil|
The interpretation of the multiple regression equation is an extension of the way we interpreted the bivariate regression equation.
|A state’s teen pregnancy rate (Ŷ) goes up by 9.736 for each 1% increase in the unemployment rate (X1), holding expenditure per pupil (X2) constant.|
|A state’s pregnancy rate goes down by .007 with each $1 increase in the state’s expenditure per pupil (X2) holding the unemployment rate (X1) constant.|
|The value of a (49.813) reflects the state’s teen pregnancy rate when both the unemployment rate and the state’s expenditure per pupil are equal to zero.|
Like bivariate regression, multiple regression has a coefficient of determination, R².
|Multiple Regression and Coefficient of Determination|
|The coefficient of determination for multiple regression is R².|
|Measures the proportional reduction of error that results from using the linear regression model.|
|We obtained an R² of .267. This means that by using the states’ unemployment rates and expenditures per pupil to predict pregnancy rates, the error of prediction is reduced by 26.7% (.267 x 100).|
Just as there is a bivariate correlation coefficient, r, there is also a multiple correlation coefficient, R.
|Multiple Correlation Coefficient|
|The square root of R²(R) is the multiple correlation coefficient.|
|Measures the linear relationship between the dependent variable and the combined effect of two or more independent variables.|
|For example, R=.52. It indicates that there is a moderate relationship between teen pregnancy rate and both employment rate and expenditure per pupil.|
Chapter 13 presented a great deal of information. It introduced you to some of the most important ideas and statistical techniques used in social science, including criminal justice. If you ever do carry out your own research, or if you need to read and understand contemporary criminal justice research, you will need to understand multiple regression beyond this brief introduction. A second graduate criminal justice statistics course would emphasize multiple regression and other multivariate techniques. Should you want or need a more complete, yet still understandable introduction to multiple regression, you might read Paul Allison’s book Multiple Regression: A Primer (1999).
Module 6 Data Set
Please click here for a printable PDF of the section on The Module Six Dataset.
Regression and Correlation
Remember that the basic idea behind measures of association is to describe whether knowing a score on an independent variable helps us to predict scores on a dependent variable. Proportional Reduction in Error (PRE) statistics assess the percentage or proportional improvement in prediction when we consider the value of an independent variable. When both the independent and dependent variables are measured at the interval-ratio level, we can use regression and correlation.
Regression and correlation can be used in both exploratory data analysis and in testing specific hypotheses. The measure r2 is a PRE measure of association.
Our text does not explain testing r for significance. However, in exploratory data analysis, this can help us to decide which relationships to investigate further. (SPSS computes significance tests for r.)
|Correlations From SPSS|
At the same time that I computed these correlations, I’d also look at the scattergrams to see if any of the relationships between variables seemed to be curvilinear rather than straight line. The correlation coefficients in bold are the statistically significant ones (p< .05). let’s look at each of these. What do these correlations suggest about the relationships between variables?
Note that I didn’t talk about causation; only association. Sine these data do not come from an experimental design, we can’t clearly separate out causes. For example, age doesn’t cause a higher number of arrests, but it is an important variable to examine. The longer someone has been alive, the greater the opportunity for criminal justice system contact, and since many criminals age out of crime, those older sentenced defendants are more likely to be career criminals.
In exploring the data, we would find that all those identified as gang members were initially classified to maximum security (a matter of policy). Gang members tend to be younger than others sentenced inmates in the sample. (r = -.382) Perhaps this explains the negative association between age and classification to max (r =-.311) . We might want to analyze non-gang sentenced defendants as a separate group.
We can examine two of our hypotheses from Module four using the original interval-ratio data and regression and correlation.
The correlation coefficient is .351. This indicates a moderate positive relationship between the variables. As PCL-R scores increases (indicating higher tendencies toward psychopathy), number of arrests increases. This result is statistically significant (our text describes statistical tests of significance for correlations until Chapter 14). We would reject the null hypothesis.
Our linear equation would be:
Predicted Arrests = 3.27 +.089(PCL-R).
For example, we would predict that a person with a PCL-R Score of 10 would have a predicted record of 4.16 arrests. A person with a score of 30 would have a predicted record of 5.94 arrests.
r2 = .123. This means that considering PCL-R scores explains 12.3% of the variation in number of arrests.
The correlation coefficient is -0.44. This indicates a strong negative relationship between the variables. As Self Control (GRAS) scores increases (indicating higher self control), number of arrests decreases. This result is statistically significant (though our text does not describe statistical tests of significance for correlations). We would reject the null hypothesis.
Our linear equation would be:
Predicted Arrests = 7.68 – 0.065(SC).
For example, we would predict that a person with a Self Control Score of 36 would have a predicted record of 5.34 arrests. A person with a score of 72 would have a predicted record of 3.0 arrests.
r2 = .19. This means that the considering Self Control scores explains 19% of the variation in number of arrests.
We can use multiple regression to explore these relationships further. For example, we saw earlier, in our cross-tabulations, age is related to number of arrests (+.0486); self control score is related to arrests (-0.440) and Psychopathology is also related to arrests ((+0.351). Remember high Self Control Scores indicate high self control and it makes sense that inmates with higher self control would have fewer arrests. The correlations above between age and Self control (-.045) and age and PCL-R (-0.016) are not significant. The correlation between PCL and Self Control scores is moderate (-0.645) In a multiple regression analysis, we could consider whether, after the effects of age are considered, the relationship between arrests and self control or age and PCL-R changes.
Using SPSS, the equation for the predicted number of Arrests based on knowledge of both Age AND Self control score AND PCL is:
Predicted Arrests = 2.288 + 0.118(Age) –0.063(Self Control) + 0.038 (PCL)
The value of the coefficient b for each variable is statistically significant. That is, when age is controlled, self control score is still a predictor of arrests. The positive sign for Age indicates that arrests increase with age. However, the negative sign associated with Self Control score indicates that lower scores (i.e. lowerself control) are associated with higher numbers of arrests even after controlling for Age and PCL. The positive sign associated with PCL indicates that higher numbers of arrests are associated with higher PCL, even controlling for age and Self Control.
Check your understanding of the concept of multiple regression with these calculations below.
|If age=20||Self Con = 45||PCL=R = 18|
|If age=30||45||PCL=R = 18|
|If age=40||Self con = 45||PCL=R = 18|
The first column shows predicted numbers of arrests for persons ages 20, 30 and 40 who have the same Self Control and PCL scores.
Could you calculate predicted number of arrests for 20 year olds with Self control scores of 25 and 65?
Answer: 3.757 arrests predicted for a 20 year old with a Self Control score of 25 and a PCL score of 18; 1.237 arrests for a 20 year old with a Self Control score of 65 and a PCL score of 18. As Self Control scores increase (greater self control) the number of predicted arrests decreases, when age and PCL scores are controlled.
Place an order in 3 easy steps. Takes less than 5 mins.