thespacebetweenstars.com

Understanding Simple Linear Regression in R: A Comprehensive Guide

Written on

Chapter 1: Introduction to Simple Linear Regression

Simple linear regression is a statistical approach that helps to describe the connection between a dependent variable (outcome) and an independent variable (predictor). This method is extensively utilized in predictive modeling, and R, equipped with powerful statistical tools, offers a user-friendly interface for conducting this analysis.

In this model, the relationship between the dependent variable ( Y ) and the independent variable ( X ) is represented through a linear equation:

[ Y = beta_0 + beta_1 X + epsilon ]

In this equation, ( beta_0 ) signifies the intercept, which is the anticipated value of ( Y ) when ( X ) is zero; ( beta_1 ) denotes the slope, indicating the change in ( Y ) with a one-unit change in ( X ). A negative slope suggests that as ( X ) increases, ( Y ) tends to decrease, while ( epsilon ) represents the error term. The aim is to estimate the parameters ( beta_0 ) and ( beta_1 ) that best represent the data.

Section 1.1: Conducting Simple Linear Regression in R

Let's delve into a practical example using R. We will examine whether height (in inches) can be forecasted based on age.

Download the data here.

Load the Data:

library(readxl)

age_and_height <- read_excel("ageandheight.xls") # Load the data

view(age_and_height) # Display the data

Exploring the Data

Before executing the regression, it's beneficial to visualize the data to understand the relationship between the variables and to identify any outliers.

Creating Boxplots to Identify Outliers:

par(mfrow=c(1, 2)) # Split the graph area into 2 columns

boxplot(age_and_height$age, main="Age", sub=paste("Outlier rows: ",

boxplot.stats(age_and_height$age)$out)) # Boxplot for age

boxplot(age_and_height$height, main="Height", sub=paste("Outlier rows: ",

boxplot.stats(age_and_height$height)$out)) # Boxplot for height

The boxplots for age and height indicate that there are no outliers in the dataset.

Creating a Scatterplot to Observe Linear Trends:

# Scatterplot

plot(age_and_height$height ~ age_and_height$age, xlab="Age",

ylab="Height", pch=16, col="red")

The scatterplot illustrates a discernible linear relationship between age and height.

Section 1.2: Fitting the Linear Regression Model

Next, we will utilize the lm function to fit a simple linear regression model where height is the dependent variable and age is the independent variable.

# Fit the model

lm_height = lm(height ~ age, data=age_and_height) # Create the linear regression model

summary(lm_height) # Review the results

To visualize the regression line on the plot:

# Plot with regression line

library(ggplot2)

ggplot(age_and_height, aes(x=age, y=height)) +

geom_point() +

geom_smooth(method="lm", se=FALSE, color="blue") +

labs(title="Linear Regression of Age on Height",

x="Age",

y="Height")

Assessing Model Fit

Before interpreting the results and utilizing the model for predictions, we should verify a crucial assumption of linear regression: that the residuals follow a normal distribution. This can be assessed using a QQ plot or a formal test like the Shapiro-Wilk test. Model fit diagnostic ==================================

plot(lm_height$residuals, pch=16, col="red")

qqnorm(lm_height$residuals, pch=16, col="red")

qqline(lm_height$residuals, pch=16, col="red")

shapiro.test(lm_height$residuals)

The points align closely with the straight line, except for two, indicating a satisfactory fit. The p-value from the Shapiro-Wilk test is 0.3, which is greater than 0.05, confirming that the residuals are normally distributed.

Making Predictions

We can leverage our fitted model to make predictions. For instance, let’s predict the height of a child aged 20.5 months. The output reveals the intercept (( alpha )) and slope (( beta )) values for age. For a 20.5-month-old child, ( alpha ) is 64.92 and ( beta ) is 0.635, leading to the prediction of height in centimeters as follows:

[ 64.92 + (0.635 times 20.5) = 77.94 , text{cm} ]

# Extract fitted values

lm_height$fitted.values

plot(age_and_height$height ~ lm_height$fitted.values, xlab="Predicted values",

ylab="Observed values", pch=16, col="red")

# Predict height for a 20.5-month-old child

new_data <- data.frame(age=20.5)

predict(lm_height, new_data) # 77.94 cm

Documentation of the Analysis

The analysis indicates that age is a significant predictor of height, with ( F(1,10) = 880 ), ( p < 0.0001 ). Age accounted for 98.8% of the variability in height (R² = 0.988). The regression equation derived is:

[ text{Height} = 64.92 + (0.635 times text{age}) ]

Chapter 2: Additional Resources

The first video provides a clear explanation of simple linear regression in R, guiding viewers through the concepts and code step-by-step.

The second video offers a detailed, step-by-step tutorial on performing linear regression in R, making it ideal for beginners and those looking to enhance their understanding.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# Navigating Amazon Success: Lessons from a $950,000 Journey

Insights from my Amazon journey, sharing valuable lessons learned to help you avoid common pitfalls.

Rediscovering Your Infinite Self: Embrace Love and Unity

Explore the concept of infinite self and unity with love in this enlightening piece.

# Exploring the Multiverse: Time and Consciousness Unveiled

Dive into the complexities of the multiverse theory and its implications on time and self-awareness through a philosophical lens.