{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Case 01 - Real Estate\n", "\n", "Consider you want to analyse the real estate market. It was not possible to obtain data related to each appartment. So it was used the information from national The information you have available is the average information in each municipality. \n", "\n", "'https://github.com/masterfloss/data/blob/main/realEstate1.xlsx?raw=true'\n", "\n", "Create a model that explains the price. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import libraries\n", "\n", "import pandas as pd\n", "import statsmodels.api as sm\n", "from statsmodels.stats.outliers_influence import variance_inflation_factor\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import linear_model\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.neural_network import MLPRegressor\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "import sklearn.metrics as metrics" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 object\n", "price2018 float64\n", "price2000 int64\n", "purchacingPower2017 float64\n", "crime2019 float64\n", "crime1993 float64\n", "wage2018 float64\n", "waste2018 float64\n", "wasteSel2018 float64\n", "IMT2018percapita float64\n", "IMI2018percapita float64\n", "tourism2018 object\n", "wage2018.1 object\n", "grad float64\n", "tourism float64\n", "wage float64\n", "waste float64\n", "tax float64\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Preprocessing\n", "\n", "df=pd.read_excel('https://github.com/masterfloss/data/blob/main/realEstate1.xlsx?raw=true')\n", "df1=df\n", "df1[\"waste2018\"]=pd.to_numeric(df1.waste2018, errors='coerce')\n", "df1[\"tourism\"]=pd.to_numeric(df1.tourism2018, errors='coerce')\n", "df1[\"wasteSel2018\"]=pd.to_numeric(df1.wasteSel2018, errors='coerce')\n", "df1[\"wage\"]=pd.to_numeric(df1.wage2018, errors='coerce')\n", "df1['waste']=df1[\"wasteSel2018\"]/df1[\"waste2018\"]\n", "df1['tax']=df1['IMT2018percapita']+df1['IMI2018percapita'] \n", "df1=df.dropna()\n", "df1.dtypes" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\profc\\AppData\\Roaming\\Python\\Python38\\site-packages\\statsmodels\\tsa\\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only\n", " x = pd.concat(x[::order], 1)\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: price2018 R-squared: 0.771
Model: OLS Adj. R-squared: 0.767
Method: Least Squares F-statistic: 215.8
Date: Sun, 21 Nov 2021 Prob (F-statistic): 2.26e-100
Time: 16:13:58 Log-Likelihood: -3774.3
No. Observations: 327 AIC: 7561.
Df Residuals: 321 BIC: 7583.
Df Model: 5
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const -7.593e+04 7109.203 -10.681 0.000 -8.99e+04 -6.19e+04
purchacingPower2017 1241.9925 93.044 13.348 0.000 1058.940 1425.045
crime2019 -610.6191 195.320 -3.126 0.002 -994.889 -226.350
waste 7.775e+04 1.97e+04 3.947 0.000 3.9e+04 1.17e+05
tourism -3.7013 1.177 -3.145 0.002 -6.017 -1.386
tax 227.1894 14.862 15.287 0.000 197.951 256.428
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 60.560 Durbin-Watson: 1.240
Prob(Omnibus): 0.000 Jarque-Bera (JB): 142.319
Skew: 0.908 Prob(JB): 1.25e-31
Kurtosis: 5.674 Cond. No. 2.44e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.44e+04. This might indicate that there are
strong multicollinearity or other numerical problems." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: price2018 R-squared: 0.771\n", "Model: OLS Adj. R-squared: 0.767\n", "Method: Least Squares F-statistic: 215.8\n", "Date: Sun, 21 Nov 2021 Prob (F-statistic): 2.26e-100\n", "Time: 16:13:58 Log-Likelihood: -3774.3\n", "No. Observations: 327 AIC: 7561.\n", "Df Residuals: 321 BIC: 7583.\n", "Df Model: 5 \n", "Covariance Type: nonrobust \n", "=======================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "---------------------------------------------------------------------------------------\n", "const -7.593e+04 7109.203 -10.681 0.000 -8.99e+04 -6.19e+04\n", "purchacingPower2017 1241.9925 93.044 13.348 0.000 1058.940 1425.045\n", "crime2019 -610.6191 195.320 -3.126 0.002 -994.889 -226.350\n", "waste 7.775e+04 1.97e+04 3.947 0.000 3.9e+04 1.17e+05\n", "tourism -3.7013 1.177 -3.145 0.002 -6.017 -1.386\n", "tax 227.1894 14.862 15.287 0.000 197.951 256.428\n", "==============================================================================\n", "Omnibus: 60.560 Durbin-Watson: 1.240\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 142.319\n", "Skew: 0.908 Prob(JB): 1.25e-31\n", "Kurtosis: 5.674 Cond. No. 2.44e+04\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 2.44e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating a regression model\n", "y=df1['price2018']\n", "X= df1[['purchacingPower2017','crime2019','waste','tourism','tax']]\n", "\n", "X = sm.add_constant(X)\n", "results = sm.OLS(y, X).fit()\n", "results.summary()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " feature VIF\n", "0 const 26.125925\n", "1 purchacingPower2017 1.475335\n", "2 crime2019 1.887041\n", "3 waste 1.332964\n", "4 tourism 1.792031\n", "5 tax 2.547069\n" ] } ], "source": [ "# VIF (Variance inflation factor) is a measure of the amount of multicollinearity \n", "# in a set of multiple regression variables.\n", "\n", "vif_data = pd.DataFrame()\n", "vif_data[\"feature\"] = X.columns\n", " \n", "# calculating VIF for each feature\n", "vif_data[\"VIF\"] = [variance_inflation_factor(X.values, i)\n", " for i in range(len(X.columns))]\n", " \n", "print(vif_data)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Compare several algorithms in order to identify the one that predicts better.\n", "# Split train and test\n", "\n", "# Linear Regression\n", "# Ridge Regression\n", "# Lasso Regression\n", "# Bayesian Regression\n", "# Polynomial Regression\n", "# Neural Network\n", "# Random Forest\n", "# Gradient Boosting \n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Split train and test\n", "\n", "y=df1['price2018']-df1['price2000']\n", "X= df1[['purchacingPower2017','crime2019','IMT2018percapita','IMI2018percapita','waste']]\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", "\n", "y_test_predict={}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.763\n", "Accuracy on the test subset: 0.642\n" ] } ], "source": [ "# Linear Regression\n", "\n", "reg = linear_model.LinearRegression()\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "LR_ATrain=reg.score(X_train, y_train)\n", "LR_ATest=reg.score(X_test, y_test)\n", "y_test_predict['linear']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.762\n", "Accuracy on the test subset: 0.644\n" ] } ], "source": [ "# Ridge Regression\n", "\n", "reg = linear_model.Ridge (alpha = .5)\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "y_test_predict['Ridge']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.763\n", "Accuracy on the test subset: 0.642\n" ] } ], "source": [ "#Lasso Regression\n", "\n", "reg = linear_model.Lasso(alpha = .5)\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "y_test_predict['Lasso']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.757\n", "Accuracy on the test subset: 0.642\n" ] } ], "source": [ "#Bayesian Regression\n", "\n", "reg = linear_model.BayesianRidge(compute_score=True)\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "y_test_predict['Bayesian']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.818\n", "Accuracy on the test subset: 0.426\n" ] } ], "source": [ "#Polynomial Regression\n", "\n", "reg = Pipeline([('poly', PolynomialFeatures(degree=2)),('linear', LinearRegression(fit_intercept=False))])\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "y_test_predict['Poli']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.755\n", "Accuracy on the test subset: 0.619\n" ] } ], "source": [ "#Neural Network\n", "\n", "reg = MLPRegressor(random_state=1,hidden_layer_sizes = (9,7), activation='relu', max_iter=5000, solver='lbfgs')\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "y_test_predict['NN']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.838\n", "Accuracy on the test subset: 0.644\n" ] } ], "source": [ "#Random Forest\n", "\n", "from sklearn.ensemble import RandomForestRegressor\n", "reg = RandomForestRegressor(n_estimators=69, max_depth=3, random_state=0)\n", "reg.fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "\n", "y_test_predict['Rforest']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on the training subset: 0.838\n", "Accuracy on the test subset: 0.644\n" ] } ], "source": [ "# Gradient Boosting \n", "\n", "clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)\n", "print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))\n", "print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))\n", "\n", "\n", "y_test_predict['GradientBoosting']=reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MeasuresMSEMedSEMAEEVR2
linear10320.86212656647.857763.380.640.64
Ridge10174.21211434634.527963.670.650.64
Lasso10319.36212635240.687755.350.640.64
Bayesian9917.78212438960.287803.110.640.64
Poli11331.00340287984.587179.030.440.43
NN10322.48225913545.828426.010.630.62
Rforest10597.78211291876.168017.240.660.64
GradientBoosting10597.78211291876.168017.240.660.64
\n", "
" ], "text/plain": [ "Measures MSE MedSE MAE EV R2\n", "linear 10320.86 212656647.85 7763.38 0.64 0.64\n", "Ridge 10174.21 211434634.52 7963.67 0.65 0.64\n", "Lasso 10319.36 212635240.68 7755.35 0.64 0.64\n", "Bayesian 9917.78 212438960.28 7803.11 0.64 0.64\n", "Poli 11331.00 340287984.58 7179.03 0.44 0.43\n", "NN 10322.48 225913545.82 8426.01 0.63 0.62\n", "Rforest 10597.78 211291876.16 8017.24 0.66 0.64\n", "GradientBoosting 10597.78 211291876.16 8017.24 0.66 0.64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use metrics to compare\n", "\n", "DicMeasures={}\n", "DicMeasures={'Measures':['MSE','MedSE','MAE','EV','R2']}\n", "for a in y_test_predict.keys():\n", " # Mean absolute error \n", " DicMeasures[a]= []\n", " DicMeasures[a].append(round(metrics.mean_absolute_error(y_test, y_test_predict[a]),2))\n", " # Mean squared error \n", " DicMeasures[a].append(round(metrics.mean_squared_error(y_test, y_test_predict[a]),2))\n", " # Median absolute error \n", " DicMeasures[a].append(round(metrics.median_absolute_error(y_test, y_test_predict[a]),2))\n", " # Explain variance score \n", " DicMeasures[a].append(round(metrics.explained_variance_score(y_test, y_test_predict[a]),2)) \n", " # R2 score \n", " DicMeasures[a].append(round(metrics.r2_score(y_test, y_test_predict[a]),2))\n", " \n", "df=pd.DataFrame(data=DicMeasures)\n", "pd.options.display.float_format = '{:.2f}'.format\n", "df.set_index('Measures').T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }