Building and Evaluating Predictive Models using SAS Enterprise Miner



Here is Your Sample Download Sample 📩


The principal point of this venture is to assess the cost of houses in lord district, Melbourne in view of different variables involved to it. With the assistance of such examination individuals and organizations can know ahead of time regarding how much a house with a characterized set of particulars and highlights would cost in a space. In the course of this project, I will investigate the following:

  1. How much does a house in King County, Melbourne, cost with certain features?
  2. What fundamental factors are driving up the cost of a home?

Data Description:

I used dataset retrieved from data repository website This dataset contains house sale prices for Melbourne homes 

Model Diagram:

Setting up the project and exploratory analysis

Data Cleaning:

A few variables, such as the unique ID, latitude, longitude, and date the house was sold, were eliminated at the beginning of the analysis because they do not significantly influence house prices.

Firstly, checked the house prices based on their Zip codes to find out insights for the problem being explored.

Have stuck one of the charts that was utilized for the underneath investigation. I've noticed that some zip codes have significantly more house sales than others. Between 60 and 559 observations were made. We can see that by and large, the places of some postal districts are more costly and are additionally greater in sqft. In comparison to the houses in the rural area, those in close proximity to some zip codes are relatively recent. This perception helped in distinguishing the postal divisions that are to be utilized in building my relapse model.

After examining the houses by location, I next investigated the output variable, house price, and the relationship between it and the other variables.

The output variable's analysis:

I've noticed that there are a lot of outliers at the top of the distribution; a few houses are worth more than $600,000, and the right side of the distribution is slightly skewed.

Price's association with and correlation with continuous variables:

Now, examined the connection between the output variable (i.e., house price) and the continuous variables in the dataset. I have utilized connection coefficients to investigate expected relationship between the factors.

Data exploration diagram:


Price and sq. ft. of living space clearly have a linear relationship (r = 0.7), which means that there is a strong positive correlation and that it is a good predictor of house price. Essentially, tracked down the connection coefficients for every one of the constant factors.

The analysis revealed that, if a basement is present, there is a moderate correlation between the size of the basement and the price of the house, as well as a small correlation with the year of the renovation (if renovated).

Basement and renovation could be more interesting for my analysis if they were categorized as dichotomous variables (e.g., 0 for no basement, 1 for a basement).

Decision Tree Model

Decision Tree Model Based on the analysis, it was clear that the size of the basement (if one is present) and the year of the renovation (if one was done) have a small impact on the house price.

Basement and renovation could be more interesting for my analysis if they were categorized as dichotomous variables (e.g., 0 for no basement, 1 for a basement).

I discovered through the use of the scatterplots that sqft_above and sqft_living15 have a strong connection to price. So, we looked at their associations with sqft_living and found that, as expected, there is a strong positive relationship (r>0.69) between the three variables. It was obvious for sqft_above because it is the same as sqft_living minus sqft_basement and both affect prices.

Due to its high correlation with sqft_living, I was unsure whether the relationship with house price for sqft_living15 was caused by the average square footage of the 15 closest houses. Using the variance inflation factor (VIF), I discovered the impact of sqft_living15 and included the column alongside sqft_living in our final prediction model.

Using the variable worth, I was able to observe the relationship between the output variable and the categorical variables. I discovered that bedrooms, bathrooms, floors, views, and grade all have moderate to strong correlations with price.

I tried to include interaction variables or fit a higher-order polynomial to the input. For instance, I attempted to fit a quadratic capability on sqft_living and different highlights.

Bedrooms*bedrooms: I was doing this so that houses with additional rooms 'get more weight'

as it were. This indicates that this feature will have an effect on homes with a lot of bedrooms.

Bedrooms*bathrooms: When both values are large, this value is large. Along these lines, houses with numerous rooms and washrooms will likewise 'get more weight'.

Stat explorer on the transformed variables:

The correlation between price and the newly created, transformed, and replaced variables has been provided by the stat explorer. I tried a variety of feature combinations to find the best fit for our model based on the results.

Data Partition (Splitting the data):

Divided the data into training (60%), validation (20%) and test (20%).

Ceaseless Factors:

  • The terms "sqft_living," "sqft_above," and "basement" are moderately or strongly correlated with price.
  • These three variables are strongly correlated, as sqft_living = sqft_above and sqft_basement.
  • sqft_lot, sqft_lot15, and yr_built are not strongly correlated with cost.

Dichotomous factors:

  • waterfront, renovated basement, and present basement -- a slight price connection.

Qualitative variables:

  • There is a moderate to strong correlation between price and bedrooms, bathrooms, floors, views, and grade.

Data Partition (Splitting the data):

Divided the data into training (60%), validation (20%) and test (20%).

Regeession Model:

Regression in Simple Linear Form:

Initially, I calculated the Root Mean Square Error on the test data and attempted to predict house prices using simple linear regression with sqft_living as input. In a similar vein, in order to determine which property price estimator was most accurate, I performed the same test on each feature in the dataset and compared the RMSE. The test error was the smallest for sqft_living (RMSE: 268279.643883), indicating the best house price estimate for the investigated dataset.

Regression Multiple:

Now I tried to use multiple features to predict price. Utilized the best single estimator, sqft_living, in conjunction with the remaining features, based on the results of simple linear regression.

tested each of the remaining features one at a time in conjunction with sqft_living (for instance, sqft_living and bedrooms_squared, etc.), and used training error to select the best combination. Toward the end, chose the model intricacy (number of highlights) utilizing the approval blunder and the test mistake.

When all of the features were chosen, the regression result was:

To find the best combination of features for my model, I tried out the following options:

The relapse result got after the previously mentioned highlights were chosen:

Model comparison result:


In conclusion, we were able to forecast house prices by employing a combination of decision tree analysis and regression-based analysis. We started by looking at the data, cleaning it up, dealing with missing values, and finding variables that might be relevant. The data were then analyzed using decision tree models to identify the most crucial variables for predicting house prices. On our training and validation sets, we discovered that the two-way split decision tree model performed the best.

Then, we utilized relapse examination to construct a prescient model utilizing the main factors distinguished by the choice tree models. Based on its validation error rate, we determined that the best model contained a subset of the most significant variables.

In general, our findings demonstrate that a powerful strategy for predicting house prices can be a combination of decision tree and regression-based analysis. We were able to identify the most significant variables using the decision tree models, and we were able to use those variables in the regression analysis to create a more precise predictive model. Real estate companies and others who are interested in predicting house prices and comprehending the main factors that drive those prices might benefit from these findings.


Abdelaal, A. E., & Ouda, H. (2018). Predicting house prices using multiple regression analysis: A case study from Egypt. International Journal of Computer Applications, 179(38), 16-22.

Wang, C., & Yao, X. (2019). Predicting Housing Prices with Machine Learning Techniques. Journal of Computer Science and Technology, 34(5), 1025-1041.

Chang, C. L., Chen, Y. A., & Lee, C. C. (2015). Modeling and forecasting of housing prices using hybrid intelligent methods. Journal of Real Estate Research, 37(3), 341-376.

Yudistira, D., Wahyudi, S., & Azizah, A. F. (2020). Housing price prediction using multiple linear regression and decision tree regression with hyperparameter tuning. Journal of Physics: Conference Series, 1477(3), 032070.

Lee, J., Kim, H., & Kim, Y. (2020). Comparison of regression analysis and machine learning algorithms for predicting housing prices. Sustainability, 12(15), 6201.