Skip to content

The purpose of this project is to develop a model for the Sale Price of a home in Ames, Iowa based on the other variables in the data set

Notifications You must be signed in to change notification settings

NavarroAlexKU/Predicting-Housing-Price

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using Linear Regression To Predict Housing Price

The purpose of this project is to develop a model for the Sale Price of a home in Ames, Iowa. Based on the other variables in the data set, we will use this model to help us predict housing price.

ScreenShot

Authors

🔗 Social Media Links

linkedin

Documentation

You can get the dataset used in the analysis by downloading it at the CRAN website.

Data

Project Topics:

In the analysis, we will touch on concepts such as exploratory data analysis, data preprocessing, model selection, and model diagnostics.

Installation & Packages:

App Screenshot

The analysis was done using R, you will need the following packages to run the code.

1.) MASS

2.) ggplot2

3.) Sleuth2

install.packages("MASS")

install.packages("ggplot2")

install.packages("Sleuth2")

Exploratory Data Analysis:

There is a lot of variables in this data set. One thing I always like to do is look at the data structure and summary of the dataset. Doing this allows me to see how many NaN values are in the data set and the unique data types I will be working with.

# Execute Summary and Structure of Data:
summary(data)
str(data)

App Screenshot

App Screenshot

It's good practice to plot all of our independent variables against our dependent variable SalePrice so we can see if there is any correlations between the two variables. This also can help us eliminate variables right away if we see no correlation between the two variables. App Screenshot App Screenshot App Screenshot

Modeling:

Train/Test Split:

We want to split our data into train and test sets: for more information on this please refer to Train/Test_Split.

### Split Training Set 70/30
train <- sample(2258,1800)
test <- (c(1:2258)[-train])

Modeling Strategy:

There are many different strategies one can utilize when trying to determine the best predictors for our dependent variable SalePrice. You could use:

1.) forward stepwise regression

2.) best subset

3.) backwards elimination

and many more.

For this specific demonstration, I'll be looking at the pvalue for each coefficient. If the pvalue is greater than 0.05, I will remove the variable from the model and then rerun the model until all I am left with is variables that are considered statistically signficiant. After executing the above process, my final model with continious variables only is the following:

App Screenshot

Model Check Diagnostics:

Some of diagnostic plots we can look at is the fitted vs the residuals, testing normality of the model and the Shapiro-Wilkins test. For this project, I will not go into the break down for each of these check diagnostic plots but will produce a future project going more into depth over this topic.

For now, I will say that we want the variance for our residuals vs fitted plot to be constant. We can see here that the variance is constantly changing. One method we can do to try and fix this is using the boxcox method to transform our data. App Screenshot App Screenshot App Screenshot

Box Cox Transformation:

# Run boxcox transformation to help normalize data:
    boxcox(SalePrice~Overall.Qual + Year.Built + Year.Remod.Add + BsmtFin.SF.1 + Total.Bsmt.SF + X1st.Flr.SF + Gr.Liv.Area + TotRms.AbvGrd +Garage.Yr.Blt + Wood.Deck.SF, data = num.ames)

The below output shows that our lambda value is closes to zero. Therefore, we will take the log transformation of our dependent variable SalePrice. App Screenshot

Fitted vs Residuals After Taking Log Transformation:

App Screenshot While we can still see clusters of data points in some portions of the output, we can see that the variance of our model looks much better after taking the log transformation.

Modeling Categorical Variables:

Using the anova function in R, I will fit one categorical variable at a time to the numeric only model until all of the variables remaining in my model are statistically signficiant.

Final Model:

After running the anova function, the following is my final model and predictors: App Screenshot

Final Model Check Diagnostics:

The variance in our residuals vs fitted plot looks consistent in our final model.

App Screenshot

We can see some skewness in our normal distribution plot but overall our model looks good when testing normality.

App Screenshot

App Screenshot

Housing Price Predictions:

The final model shows the following upper and lower bound housing price prediction: App ScreenShot

About

The purpose of this project is to develop a model for the Sale Price of a home in Ames, Iowa based on the other variables in the data set

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published