Everyone who has ever owned or lived in a house knows at least a little bit about the whims of the real estate market. Big houses cost more, neighborhood matters, proximity to basic services is great, age and style are important in some markets, you name it. But what is it that matters the most? This is a question that visualization can help us answer.
Let’s talk data
Here in the exciting (and, more importantly, exacting) world of data science, we love asking the hard questions. Does size really matter in real estate? How does it fare compared to other features of the house? How important is it?
In this episode of Super Awesome Templates (SATs), we ask that you put yourself in the shoes of a young travelling software developer, disillusioned by the excesses of the big city and seeking respite in the serenity of good ole’ Iowa state. Opting for the gods of probabilities to decide his fate, he rolled a random number generator on his laptop and wound up with Ames, Iowa as his destination of choice. Very interesting choice, as we will soon see. Pre-empting a new way of life, our protagonist downloads a dataset detailing real estate information including lot size, sales price and 31 other features of 2930 houses in Ames.
The question is: Which factor affects house sale price most significantly? How can he identify it with the help of Spotfire visualization and modeling tools? Let’s leverage the power of data science to expose secret trends that are hidden from common intuition. Welcome to “Real Estate Secrets: Spotting Hidden Trends Like A Pro.”
For this tutorial, extract and have a look at real-estate-secrets template if you get tripped up anywhere. Amazing technical detail and cool visualizations await.
Quick Kick-off: Grabbing the Data
Fire up your Spotfire program, and select the following sequence: File > Add Data Tables > Add > File… > Navigate to the folder where you have downloaded and extracted real-estate-secrets ZIP file and select “Ames Housing Full.csv” > OK
View the data in Excel fashion by: Insert > Visualization > Table. A data table called “33 columns from Ames Housing Full” should appear on the page.
An Once-Over: Scanning the Data
Each house in the dataset is described by 33 values known as features. Since we have money on our minds here, the feature “SalePrice” is the name of the game. It is our target feature (or response feature, depending on who you are talking to), the single important feature that we want to build statistical knowledge around. We want to learn something about the relationship between other variables and the sale price of the house.
Let’s take a glance at the other features. Some, such as “PID” and “Column 1”, are simply unique identifiers of the houses. They don’t give us too much information about the house itself in question. Some come in the form of decimal numbers, such as House Age, Garage.Area and Lot.Area. They are self-explanatory. SalePrice also falls in this category. It is the market price for the house in year 2008, representing a dollar value in the form of what we call an integer.
Overall.Cond and Overall.Qual are also integer numbers, but they are a categorical variables where each numerical value represents a qualitative grade of house condition or house quality, respectively. A house with Overall.Cond of 9 is great, while one with an Overall.Cond of 1 is terrible. Their integer value represents a human-assigned quality metric, instead of a natural scientific measurement.
Exter.Cond and Exter.Qual are also categorical variables, but they are measured on a scale represented in abbreviations. In Exter.Cond, “Po” is poor and “Ex” is excellent. This is an example of a categorical variable, in contrast to a numerical one. Another kind of categorical variable has the effect of marking the house out with a label. The feature “Neighborhood” is a great example of this. We have “NAmes” for North Ames, “CollgCr” for College Creek, etc.
That’s a lot of variables we have to play with! Let’s dive right in.
Affirming (or, better yet, rejecting) What Your Neighbors Say: Data Visualization
First things first: let’s see how our sale price is distributed across houses around Ames. Make a bar chart: Insert > Visualization > Bar Chart. Don’t worry about the plot that appears immediately: it makes no sense until you give it some quality control. Change up the x-axis: Click on the dropdown button below the horizontal axis (which probably says “MS.Zoning”) > SalePrice. Right click “SalePrice” then “Auto-bin Column”. Click and hold the slider bar that appears above “SalePrice (binned)” then Drag it to the middle of the bar. Now we see that the sale price looks like a little rolling hill with a peak near the 130000 range of prices. Navigate to the page “A Quick Overview” in real-estate-secrets to see if we’re getting the same results and talking about the same thing.
Some might say that lot area, total living area, overall condition (inside the house) and overall quality (also inside the house) are probably important factors in determining the price. Let’s see if our data agrees.
Make a scatter plot: Insert > Visualization > Scatter Plot. Don’t worry about the crazy strange series of disjoint lines you see in the default plot. We will switch it up in a just a second. Switch up the y-axis: Select the dropdown button to the left of the vertical axis (which probably says “PID”) > select SalePrice. Next, switch up the x-axis: Select the dropdown button to below the horizontal axis (which probably says “Column 1”) > select Lot.Area. Check out the cluster of points we just created. Your thoughts?
So then, the question is: does bigger area imply a higher sale price? Yes and no. Notice how the first scatter plot is very lumpy? This implies a weak correlation with a lot of variation. A house at 10000 Lot.Area seems to cost anywhere between 50000 and 500000. Of course, as we move to a larger Lot.Area, the floor and ceiling of this bracket both increases. But, of course, as the bracket changes, the distribution of points within it also becomes more sparse. This is typical of a flaky and non-commital but still somewhat linear relationship between Lot.Area and SalePrice.
Exploring Curious Spaces: More Visualizations
Let’s look at Grade Living Area next. Repeat the same steps as above, but use Gr.Living.Area for the x-axis instead of Lot.Area.
What about Grade Living Area? This is a metric used in the real estate industry to quantify the total livable area inside a home. Intuitively, as well as evidently, this seems like a far better feature in terms of its correlation with SalePrice. The scatter plot seems to form a steady straight stream, almost converging into a straight line. A tight cluster around a straight line indicates a strong linear relationship.
Make similar visualizations for Overall.Cond and Overall.Qual. Simply repeat the same steps, but set the bottom dropdown variable to “Overall.Cond” or “Overall.Qual” instead of “Lot.Area” or “Gr.Living.Area”. Is the linear relationship strong between these variables and SalePrice?
Want to learn more? Navigate to the page “How Good are our Assumptions?” in real-estate-secrets for some juicy technical detail.
In our dataset, we have a grand total of 33 columns. You can plot anything at all against SalePrice, and get some pretty interesting results. Exter.Qual and Exter.Cond are two good ones to try. They indicate the user rating of external appearance and external upkeep respectively. Year.Built, House.Style and Neighborhood also grant some cool results. Remember to vary between bar charts, scatter plots and combination plots where appropriate. You won’t want to make misleading plots that can mess up your interpretation! Navigate to the pages “What Determines Sale Price?”, “What Really Determines Sale Price?” and “What Else Really Determines Sale Price?” (Eloquence is one of my strong suits.)
Let numbers do the talking: Modeling the Data
Having built some intuition about some of the variables that affect SalePrice, let’s hash out useful numbers which can help us distinguish between a strong linear relationship and a weak one easily and effectively. To do this, we can create a linear regression model over the same plots that we did previously. Navigate to any plot, and do: right click within the boundary of the plot > Properties > Lines & Curves > Add > Straight Line Fit… A straight black line approximating (or, as we like to say, modelling) the trend in the plot appears.
The idea behind a linear regression is really to draw the best line through the plot such that it makes the least overall mistake in predicting each SalePrice on y-axis from, say, Gr.Living.Area on the x-axis. Since the linear regression line always minimizes the mistakes made, observing a bad-looking line simply means that, as hard as your computer tries, it cannot find a good linear regression model for the feature you have specified. In other words, the relationship between your feature and SalePrice is not linear.
A good (bad) example of this is modelling a linear regression between SalePrice and PID. There is no real relationship between the ID number of a house and its sale price, so the regression line and the scatter plot are completely at odds. In contrast, a regression line built between SalePrice and Gr.Liv.Area should look quite decent. Navigate to “Let’s Make Models” in real-estate-secrets to see how this looks like.
Mousing over your regression line, you should see a black tooltip indicating the value of some numbers. Today, we would be focusing on what is “R^2 of the regression”, which is a measure of how strong or weak a linear regression is. This metric ranges from 0 to 1, with 0 indicating no linear relationship and 1 indicating a completely linear relationship.
For me, the regression line between SalePrice and Gr.Liv.Area has a R2 of 0.600, while the one between SalePrice and Lot.Area is only 0.079. This indicates that Gr.Liv.Area has a much, much better linear relationship with SalePrice, than Lot.Area does. As Gr.Liv.Area increases, SalePrice increases somewhat proportionally. However, there is a massive caveat in this interpretation. Can you identify it? (hint: have a look at the page “Deceptive Features?” in real-estate-secrets. What is going on there?)
Doing the same for Overall.Qual yields a R2 score of 0.686. This indicates that Overall.Qual may be a better predictor of SalePrice than Gr.Liv.Area is. Does this seem intuitive or plausible? What is the caveat here? Plotting a regression line for SalePrice against Neighborhood seems to work too, but does it make any sense at all? Do note the each neighborhoods has a rather distinct range of SalePrice. There are better ways to measure the predictive power of the neighborhood a house is located in, but we will not be elaborating on that today. Look out for more datashoptalk articles on this topic!
Moving Ahead: Data, Data, and More Data
Congrats! You’re well on your way to discovering little-known secrets in the world of the Real Estate through visualization and modelling. The same set of procedures can be applied to any other data set. Remember, start with your assumptions, explore wild options, make models and compare them to gain knowledge. For more in-depth details on the procedures in this tutorial, be sure to read up on the “Look here.” boxes set up on each page of real-estate-secrets. Knowledge is power 🙂
Until next time,
Your friendly neighborhood data scientists.