Introduction

When you purchase a brand new car, the manufacturer’s suggested retail price, the MSRP, is predetermined. But when you sell your car, and ownership of the vehicle changes hands, how do you determine what the car is worth? Most likely you go to Kelley Blue Book or CarFax to get an appraisal of your car. There are many factors that contribute to a car’s value, from mileage to engine type. In this project I will show you how you can estimate a car’s worth from data gathered from scratch! The workflow looks like this:

  1. Scrape Kelley Blue Book’s used car listings using a Python script
  2. Data Cleaning
  3. Feature Engineering
  4. Exploratory Data Analysis
  5. Designing a predictive model

Part I: Collecting The Data

For any data scientist enthusiast, web scraping is a valuable. Most of the time data scientists work with prepared data sets: Perhaps they are supplied by employers, clients, or by data set hosts like Kaggle. But for the ever curious data scientist, collecting unstructured data can be a worthwhile venture. Scraping the web for data can be tricky but Scrapy, a Python based program, makes the process much easier. Here I will show you how to scrape KBB.com for used car listings.

library(reticulate)

How Scrapy Works

Scrapy allows you to program a bot (i.e. a Python script) that collects whatever data you want from a webpage. Although I won’t go in to depth about all the code a Scrapy bot requires, I will outline key points in Scrapy bot design.

  1. Initialize a Scrapy bot directory . Starting a Scrapy project is easy: After installing Scrapy got to your Python terminal (such as Anaconda Prompt) and type: scrapy startproject <Name of Project> This will create a Scrapy directory folder with all the files you need to start web scraping
  2. Setup the “Items.py” Script . This script tells the bot what elements of a website you want to scrape . For this project I want to scrape the details about the used car that is being listed
import scrapy
class KbbItem(scrapy.Item):
    car = scrapy.Field()
    mileage = scrapy.Field()
    price = scrapy.Field()
    color = scrapy.Field()
    transmission = scrapy.Field()
    engine = scrapy.Field()
    doors = scrapy.Field()
    body = scrapy.Field()
    mpg = scrapy.Field()
    kbb_expert_review = scrapy.Field()
    consumer_review = scrapy.Field()
    location = scrapy.Field()
    website = scrapy.Field()
  1. Setup the “Spider.py” script . This is the brain of the bot. The Spider script essentially “crawls” the webpage and collects the data you assigned in the Items script. This is a hefty script! I will save the explanation of this code for a blog post.
from scrapy import Spider, Request
from kbb.items import KbbItem
import re
class kbbSpider(Spider):
    #name: an attribute specifying a unique name to identify the spider
    name = "kbb_spider"
    
    #start_urls: an attribute listing the URLs the spider will start from
    allowed_domains = ['www.kbb.com']
    
    #allowed_urls: the main domain of the website you want to scrape
    start_urls = ['https://www.kbb.com/cars-for-sale/cars/used-cars/?p=1&color=']
    
    #parse(): a method of the spider responsible for processing a Response object downloaded from the URL and returning scraped data (as well as more URLs to follow, if necessary)
    
    def parse(self, response):
        text = response.xpath('//*[@id="listingsContainer"]//span[@class="page-numbers"]/text()').extract()[1]
        result_pages = ['https://www.kbb.com/cars-for-sale/cars/used-cars/?p={}&color={}'.format(x,y) for x in range(1, int(text)+1) for y in ['beige', 'black', 'blue', 'brown', 'burgundy','charcoal','gold','gray','green','offwhite','orange','pink','purple','red','silver','tan', 'turquoise','white','yellow']]
        for url in result_pages:
            yield Request(url=url, callback=self.parse_result_page)
    def parse_result_page(self, response):
        products = response.xpath('//*[@id="listingsContainer"]//a[@class="js-vehicle-name"]/@href').extract()
        product_urls = ['https://www.kbb.com' + x for x in products]
        for url in product_urls:
            yield Request(url=url, callback=self.parse_product_page)
    def parse_product_page(self, response):
        car = response.xpath('//*[@id="vehicleDetails"]/div/div/h1/text()').extract_first()
        price = response.xpath('//*[@id="pricingWrapper"]/span/span[@class ="price"]/text()').extract_first()
        mileage = response.xpath('//*[@id="keyDetailsContent"]/div/div/div/ul/li[@class="js-mileage"]/text()').extract_first()
        color = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Exterior Color:")]/text()').extract_first()
        transmission = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Transmission:")]/text()').extract_first()
        engine = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Engine:")]/text()').extract_first()
        doors = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Doors:")]/text()').extract_first()
        body = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Body Style:")]/text()').extract_first()
        mpg = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Fuel Economy:")]/text()').extract_first()
        drive_type = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Drive Type:")]/text()').extract_first()
        kbb_expert_review = response.xpath('//*[@id="readExpertReview"]/p/span[@class="title-three"]/text()').extract_first()
        consumer_review = response.xpath('//*[@id="readConsumerReview"]/p/span[@class="title-three"]/text()').extract_first()
        location = response.xpath('//*[@id="aboutTheDealer"]//p/text()').extract()[:2]
        website = response.xpath('//*[@id="dealerDetailsModuleWebsite"]/@href').extract_first()
        item = KbbItem()
        item['body'] = body
        item['mpg'] = mpg
        item['kbb_expert_review'] = kbb_expert_review
        item['consumer_review'] = consumer_review
        item['car'] = car
        item['price'] = price
        item['mileage'] = mileage
        item['color'] = color
        item['transmission'] = transmission
        item['engine'] = engine
        item['doors'] = doors
        item['location'] = location
        item['website'] = website
        yield item
  1. Setup the “pipelines.py” script . This script tells the bot how you want to save the data it collects. CSV format? Text file? It’s up to you!
from scrapy.exceptions import DropItem
from scrapy.exporters import CsvItemExporter
class ValidateItemPipeline(object):
    def process_item(self, item, spider):
        if not all(item.values()):
            raise DropItem("Missing values!")
        else:
            return item
class KbbPipeline(object):
    def __init__(self):
        self.filename = 'kbb2.csv'
    def open_spider(self, spider):
        self.csvfile = open(self.filename, 'wb')
        self.exporter = CsvItemExporter(self.csvfile)
        self.exporter.start_exporting()
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.csvfile.close()
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
  1. Setup the “settings.py” script . This script controls the settings of the bot like how fast should the bot scrape. You can change these settings here.
BOT_NAME = 'kbb'
SPIDER_MODULES = ['kbb.spiders']
NEWSPIDER_MODULE = 'kbb.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 0.5
ITEM_PIPELINES = {
    'kbb.pipelines.KbbPipeline': 300,
}

That’s it! Now navigate to your Python command line and enter: scrapy crawl <name of spider> The spider bot will begin to crawl the site, collect the items you specified, and save them in a predetermined format in the root of your project directory!

Part II: Cleaning the Data

When you finish your scraping chances are the data will not be ready for exploration. Here are some of the issues I had to address before I began my EDA.

  1. Removing spaces, tabs, symbols, etc. from my data
  2. Removing unnecessary strings from my data. For example: removing “Mileage:” from mileage data
  3. Converting columns of data into the correct data type. (e.g. changing characters to numeric type)
  4. The MPG feature has too much data per cell. “City/Highway/Combined/MPGe”. I chose to split this column into four columns
  5. Creating new columns for make and model, year built, and car condition
  6. Filling blank columns with ‘NA’

Part III: Feature Engineering

Feature engineering involves creating new columns of data from existing features. For example: Taking a column of U.S. states and making a new column called Regions based off the states location, so that Connecticut would have the value “Northeast” and Oregon would be “Northwest”.

I did not employ feature engineering for this project but I may in the future

Part IV: Exploratory Data Analysis

A preview of the data set:

head(df) %>% kable() %>%
  kable_styling() %>% scroll_box()
condition year brand make_model body color consumer_review kbb_expert_review mileage city_mpg highway_mpg gas_combined_mpg electric_combined_mpg price transmission doors engine location website
Certified 2015 Kia Kia Sedona SX Van Berry 7.9 9.0 441 18 25 21 NA 28971 6-Speed Shiftable Automatic Four Door 6-Cylinder 1100 East Golf road Schaumburg IL 60173 https://www.schaumburgkia.com/?utm_source=KBB&utm_medium=website
Used 2007 Volkswagen Volkswagen Beetle Convertible w/ Triple White Convertible Campanella White 6.9 NA 47771 19 28 22 NA 9975 6-Speed Automatic Two Door 5-Cylinder 4203 Franklin Common Ct Franklin TN 37067 http://www.sunvertibles.com/?utm_source=KBB&utm_medium=website
Used 2014 Cadillac Cadillac CTS AWD Sedan Sedan Majestic Plum Metallic 9.2 8.0 32066 20 30 23 NA 19833 6-Speed Shiftable Automatic Four Door 4-Cylinder Turbo 6100 Telegraph Road Toledo OH 43612 http://www.groganstowne.com/?stocknum=K7912010&atc_ownerid=72868&utm_source=KBB&utm_medium=website
Used 1993 Chevrolet Chevrolet Corvette Coupe Coupe Green 9.0 NA 67488 15 22 18 NA 5928 Automatic Two Door 8-Cylinder 2236 Highway 2565 Louisa KY 41230 http://www.bigbluemotorslouisa.com/?utm_source=KBB&utm_medium=website
Used 2006 Mazda Mazda MAZDA5 Touring Van Grey 7.9 NA 105194 19 24 21 NA 4999 4-Speed Automatic Four Door 4-Cylinder 9815 Evergreen Way Everett WA 98204 http://www.baysideautosales.net/?utm_source=KBB&utm_medium=website
Used 2015 FIAT FIAT 500 e Hatchback Hatchback Electric Orange Tri-Coat Pearlcoat 9.0 7.6 20579 NA NA NA 116 11895 Single-Speed Two Door Electric 18225 Ponderosa Drive Parker CO 80134 http://www.rauschmotorsllc.com/?utm_source=KBB&utm_medium=website

And now the fun part: Seeing what insights the data holds one of my favorite R packages for EDA is “XDA”. Let’s use the numSummary() function to get statistical information on our numerical features

library(xda)

numSummary(df) %>%kable() %>%
  kable_styling() %>% scroll_box()
n mean sd max min range nunique nzeros iqr lowerbound upperbound noutlier kurtosis skewness mode miss miss% 1% 5% 25% 50% 75% 95% 99%
year 17651 2011.63 4.99e+00 2018.0 1981.0 37.0 38 0 7.0 1997.5 2025.5 226 2.205 -1.310 2015.0 0 0.000 1996.0 2002.0 2008.0 2013.0 2015.0 2017.0 2018.0
consumer_review 17351 8.48 6.20e-01 10.0 1.0 9.0 40 0 0.8 6.9 10.1 61 4.929 -1.001 8.6 300 1.700 6.9 7.3 8.1 8.6 8.9 9.3 9.6
kbb_expert_review 10550 7.94 9.57e-01 9.8 3.3 6.5 54 0 1.4 5.2 10.8 90 0.908 -0.464 NA 7101 40.230 5.3 6.4 7.3 7.9 8.7 9.4 9.7
mileage 17637 55586.01 4.37e+04 364000.0 1.0 363999.0 16277 0 59891.5 -67176.2 172387.2 303 1.487 1.174 NA 14 0.079 724.4 5699.6 22661.0 41639.0 82550.0 139977.6 185112.8
city_mpg 16517 21.09 5.72e+00 58.0 8.0 50.0 45 0 8.0 5.0 37.0 224 4.258 1.347 17.0 1134 6.425 12.0 14.0 17.0 20.0 25.0 30.0 41.0
highway_mpg 16526 28.82 6.49e+00 53.0 13.0 40.0 38 0 10.0 9.0 49.0 6 -0.696 0.159 NA 1125 6.374 17.0 19.0 24.0 28.0 34.0 39.0 42.0
gas_combined_mpg 16604 23.95 5.91e+00 56.0 10.0 46.0 41 0 9.0 5.5 41.5 157 1.171 0.762 20.0 1047 5.932 14.0 16.0 19.0 23.0 28.0 33.0 41.0
electric_combined_mpg 204 108.53 1.35e+01 133.0 46.0 87.0 25 0 11.0 88.5 132.5 19 6.916 -2.416 NA 17447 98.844 53.0 88.0 105.0 114.0 116.0 119.0 119.0
price 17619 17100.66 1.50e+04 482000.0 500.0 481500.0 6578 0 10482.2 -5728.4 36200.9 988 181.155 9.158 12995.0 32 0.181 2995.0 4995.0 9995.0 13426.0 20477.5 37761.3 64900.0

Column Legend (taken from XDA github):

n= total number of rows for that variable
nunique = number of unique values
nzeroes = number of zeroes
iqr = interquartile range
noutlier = number of outliers
miss = number of rows with missing value
miss% = percentage of total rows with missing values ((miss/n)*100)
5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
. the percentile values are helpful in detecting outliers

This function is very useful for detecting the amount of outliers in a feature.

The less useful charSummary() function can provide us some insight into character-type features.

charSummary(df) %>% kable() %>%
  kable_styling() %>% scroll_box()
n miss miss% unique top5levels:count
condition 17651 0 0 2 Used:15781, Certified:1870
brand 17651 0 0 55 Ford:1846, Chevrolet:1838, Toyota:1515, Honda:1474, Nissan:1263
make_model 17651 0 0 3128 Honda Civic LX Sedan:215, Ford Focus SE Sedan:154, Chevrolet Malibu LT:120, Ford Fusion SE:114, Lexus ES 350 :112
body 17651 0 0 8 Sedan:6632, Sport Utility:5300, Hatchback:1560, Truck:1266, Coupe:1068
color 17651 0 0 2304 Charcoal:829, Black:459, Beige:434, Purple:413, Gold:381
transmission 17651 0 0 37 6-Speed Shiftable Automatic:4799, Automatic:2489, Continuously Variable Automatic:2184, 4-Speed Automatic:1277, 5-Speed Automatic:1128
doors 17651 0 0 5 Four Door:14650, Two Door:2845, Three Door:106, :40, Five Door:10
engine 17651 0 0 23 4-Cylinder:6528, 6-Cylinder:5342, 8-Cylinder:2116, 4-Cylinder Turbo:1889, 6-Cylinder Turbo:394
location 17651 0 0 4413 4319 Rt.130 South Edgewater Park NJ 08010:89, 333 Broadway Avenue Bedford OH 44146:88, Nationwide shipping to San Francisco CA 94109:85, :79, 21151 NW 2ND AVE MIAMI FL 33169:78
website 17651 0 0 10168 :275, http://www.floridafinecars.com/?utm_source=KBB&utm_medium=website:111, https://www.carvana.com/?utm_source=autotrader&utm_medium=referral&utm_campaign=autotrader-referral-ATL&utm_content=vdp-visitwebsite:110, http://www.autodirectcars.com/?utm_source=KBB&utm_medium=website:89, http://www.ncautoauction.com/?utm_source=autotrader&utm_medium=referral&utm_content=visit-dealer-site&utm_campaign=big-orng-btn:88

Here we see that the Honda Civic LX Sedan is the most frequent make and model being sold. “Charcoal”" is the most frequent car color, “6-Speed Shiftable Automatic” is the most frequent transmission and so on.

Brand and Car Body Break Down

Now that we have collected about 17,000 used car details from KBB.com, let’s see what brands dominate used car listings.

Of all the cars I collected (your results may vary), the most frequent brands were Ford, Chevy, Toyota and Honda.

Sedan and Sport Utility vehicles are very frequent in used car listings.

Customer Reviews vs KBB Expert Reviews

Let’s take a look at how ratings between car owners and Kelley Blue Book experts differ.

Here we can see that consumers are a little more lenient in how they rate a car and will give a car higher marks while experts seem to be more critical.

How Mileage Affects Price

Just as we would expect, as a car’s mileage increases its price decreases. Convertibles, coupes and luxury vehicles make up most of our outliers and they are skewing the data. For that reason, I chose to exclude them and only include the most frequent body types being sold: Sedans and Sport Utility Vehicles.

Price, Mileage and Year Distributions

Most cars are priced between $11k to $13k. There is a substantial drop of in cars valued over $15k. Some luxury vehicles selling over $450,000 are outliers and need to be removed for predictive modelling. (Try zooming in)

Most cars have mileages between 15,000-30,000 miles.

Here we see something interesting. Most cars being sold were built in 2015: They are only three years old. Why is this? According to Edmunds Used Car Report “24% of franchise used sales were 3 years old in Q1 2018”. Once a three year lease is up, many owners look to sell their car, causing a market saturation of 3-year old cars.

Can we predict a car’s resell value?

In this next section we will train a Multilinear Regression model with Ridge regularization to predict the price of used cars from data the model has never seen.

library(gradDescent)
library(data.table)
library(dplyr)
library(ISLR)
library(glmnet)
library(MASS)
library(car)
df = fread('C:/Users/pache_000/Desktop/kbb/KBB_used_final_3.csv')

#80/20 Train-Test split
splitDF = splitData(df, 0.8, seed = 1)

train = splitDF$dataTrain
test = splitDF$dataTest

For simplicity sake, I will you only use the quantitative features in this demonstration.

quantdf = select_if(train, is.numeric)
quantdftest = select_if(test, is.numeric)
quantdf$electric_combined_mpg = NULL
quantdftest$electric_combined_mpg = NULL
quantdf = na.omit(quantdf)
quantdftest = na.omit(quantdftest)
#initial model
model = lm(price ~ . , data = quantdf)

…Removing outliers

quantdf = quantdf[-c(2359, 6371, 3701, 2526, 2916),] 
quantdf = quantdf[-c(1933, 274, 6731),]
quantdf = quantdf[-c(3124, 3708, 2005, 5534),]

…applying boxcox transformation to our model

model.transformed = boxcox(model, lambda = seq(-2, 2, 1/10), plotit = TRUE,
       interp = T, eps = 1/50, xlab = expression(lambda),
       ylab = "log-Likelihood")

#Extracting best lambda by finding the maximum log-likelihood.
lambda = model.transformed$x[which(model.transformed$y == max(model.transformed$y))]

#defining a boxcox function
bc <- function (obs, lambda) {(obs^lambda-1)/lambda }

#Use mapply() to apply the bc function to our dataset using our best lambda 
quant_df_bc = mapply(bc,quantdf,lambda)

#convert quant_df_bc to dataframe
quant_df_bc = data.frame(quant_df_bc)
#new multilinear model that uses our boxcox transfomed dataframe
model.bc = lm(price ~ ., data = quant_df_bc)
plot(model.bc)

summary(model.bc)
## 
## Call:
## lm(formula = price ~ ., data = quant_df_bc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.021862 -0.002376 -0.000045  0.002434  0.015795 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.90e+02   3.45e+00  -55.09  < 2e-16 ***
## year               8.49e+01   1.52e+00   55.79  < 2e-16 ***
## consumer_review    2.76e-02   1.50e-03   18.48  < 2e-16 ***
## kbb_expert_review  2.25e-02   9.09e-04   24.72  < 2e-16 ***
## mileage           -1.68e-03   5.73e-04   -2.93   0.0034 ** 
## city_mpg           3.48e-02   6.32e-03    5.50  3.9e-08 ***
## highway_mpg       -4.10e-02   4.89e-03   -8.38  < 2e-16 ***
## gas_combined_mpg  -8.11e-02   1.03e-02   -7.90  3.2e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0039 on 8022 degrees of freedom
## Multiple R-squared:  0.655,  Adjusted R-squared:  0.655 
## F-statistic: 2.18e+03 on 7 and 8022 DF,  p-value: <2e-16
vif(model.bc)
##              year   consumer_review kbb_expert_review           mileage 
##              1.30              1.05              1.27              1.06 
##          city_mpg       highway_mpg  gas_combined_mpg 
##             81.77             29.93            173.34
coef(model.bc)
##       (Intercept)              year   consumer_review kbb_expert_review 
##         -1.90e+02          8.49e+01          2.76e-02          2.25e-02 
##           mileage          city_mpg       highway_mpg  gas_combined_mpg 
##         -1.68e-03          3.48e-02         -4.10e-02         -8.11e-02
x = model.matrix(price ~ ., data = quant_df_bc)[, -1] #Dropping the intercept column.
y = quant_df_bc$price
#define grid for our ridge regression
grid = 10^seq(3, -5, length = 100)
#create our ridge model using x, y, and grid.
ridge.models = glmnet(x, y, alpha = 0, lambda = grid)

plot(ridge.models, xvar = "lambda", label = TRUE, main = "Ridge Regression")

set.seed(0)
traindf = sample(1:nrow(x), 8*nrow(x)/10)
testdf = (-traindf)
y.test = y[testdf]

#Double check 70/30 train/test split
#length(traindf)/nrow(x)
#length(y.test)/nrow(x)

…Running 10-fold cross validation.

set.seed(0)
cv.ridge.out = cv.glmnet(x[traindf, ], y[traindf],
                         lambda = grid, alpha = 0, nfolds = 10)
plot(cv.ridge.out, main = "Ridge Regression\n")

bestlambda.ridge = cv.ridge.out$lambda.min
#bestlambda.ridge # 0.00001 best lamda for our ridge regression
#log(bestlambda.ridge) # -11.51293 log of best lamda for our ridge regression

ridge.bestlambdatrain = predict.cv.glmnet(cv.ridge.out, s ="lambda.min", newx = x[testdf, ]) 
#mean((ridge.bestlambdatrain - y.test)^2) #MSE = 0.0000156

#What is the test RMSE associated with this best value of lambda?
#sqrt(mean((ridge.bestlambdatrain - y.test)^2)) #RMSE = .003

What is the accuracy of our model after 10-fold cross validation?

compare <- cbind(actual = y.test, ridge.bestlambdatrain)
compare <- data.frame(compare)
compare  = compare %>% rename(., predicted = X1)

#Undo Boxcox transformation
invBoxCox <- function(x, lambda){ 
    if (lambda == 0) exp(x) else (lambda*x + 1)^(1/lambda) }

compare_norm = data.frame(mapply(invBoxCox,compare,lambda))
#mean(apply(compare_norm, 1, min)/apply(compare_norm, 1, max))
Metrics::rmsle(compare_norm[,1], compare_norm[,2])
## [1] 0.262

Density plot showing the distribution of actual and predicted car values:

library(highcharter)

hchart(density(compare_norm$actual), type = "area", color = "#B71C1C", name = "actual price")%>% hc_add_series(density(compare_norm$predicted), area = TRUE, name = "predicted") %>% hc_add_theme(hc_theme_db()) %>% hc_title(., text = "Actual Car Price vs. Predicted")

Applying our model on the test set…

quant_dftest_bc = mapply(bc,quantdftest,lambda)

#convert quant_df_bc to dataframe
quant_dftest_bc = data.frame(quant_dftest_bc)

quant_dftest_bc =na.omit(quant_dftest_bc)


library(lmridge)
ridge.final = lmridge(price ~., data = quant_df_bc ,  lambda=lambda)
test_prediction = data.frame(predict.lmridge(ridge.final, newdata = quant_dftest_bc))
compare_test = data.frame(mapply(invBoxCox, test_prediction, lambda))

price = data.frame(quantdftest$price)

myPrediction =  cbind(price, compare_test)

Root mean squared log error(RMSLE):

#Accuracy:
#mean(apply(myPrediction, 1, min)/apply(myPrediction, 1, max))
library(Metrics)
Metrics::rmsle(myPrediction$quantdftest.price, myPrediction$predict.lmridge.ridge.final..newdata...quant_dftest_bc.)
## [1] 0.263