When you purchase a brand new car, the manufacturer’s suggested retail price, the MSRP, is predetermined. But when you sell your car, and ownership of the vehicle changes hands, how do you determine what the car is worth? Most likely you go to Kelley Blue Book or CarFax to get an appraisal of your car. There are many factors that contribute to a car’s value, from mileage to engine type. In this project I will show you how you can estimate a car’s worth from data gathered from scratch! The workflow looks like this:
For any data scientist enthusiast, web scraping is a valuable. Most of the time data scientists work with prepared data sets: Perhaps they are supplied by employers, clients, or by data set hosts like Kaggle. But for the ever curious data scientist, collecting unstructured data can be a worthwhile venture. Scraping the web for data can be tricky but Scrapy, a Python based program, makes the process much easier. Here I will show you how to scrape KBB.com for used car listings.
library(reticulate)
Scrapy allows you to program a bot (i.e. a Python script) that collects whatever data you want from a webpage. Although I won’t go in to depth about all the code a Scrapy bot requires, I will outline key points in Scrapy bot design.
scrapy startproject <Name of Project>
This will create a Scrapy directory folder with all the files you need to start web scrapingimport scrapy
class KbbItem(scrapy.Item):
car = scrapy.Field()
mileage = scrapy.Field()
price = scrapy.Field()
color = scrapy.Field()
transmission = scrapy.Field()
engine = scrapy.Field()
doors = scrapy.Field()
body = scrapy.Field()
mpg = scrapy.Field()
kbb_expert_review = scrapy.Field()
consumer_review = scrapy.Field()
location = scrapy.Field()
website = scrapy.Field()
from scrapy import Spider, Request
from kbb.items import KbbItem
import re
class kbbSpider(Spider):
#name: an attribute specifying a unique name to identify the spider
name = "kbb_spider"
#start_urls: an attribute listing the URLs the spider will start from
allowed_domains = ['www.kbb.com']
#allowed_urls: the main domain of the website you want to scrape
start_urls = ['https://www.kbb.com/cars-for-sale/cars/used-cars/?p=1&color=']
#parse(): a method of the spider responsible for processing a Response object downloaded from the URL and returning scraped data (as well as more URLs to follow, if necessary)
def parse(self, response):
text = response.xpath('//*[@id="listingsContainer"]//span[@class="page-numbers"]/text()').extract()[1]
result_pages = ['https://www.kbb.com/cars-for-sale/cars/used-cars/?p={}&color={}'.format(x,y) for x in range(1, int(text)+1) for y in ['beige', 'black', 'blue', 'brown', 'burgundy','charcoal','gold','gray','green','offwhite','orange','pink','purple','red','silver','tan', 'turquoise','white','yellow']]
for url in result_pages:
yield Request(url=url, callback=self.parse_result_page)
def parse_result_page(self, response):
products = response.xpath('//*[@id="listingsContainer"]//a[@class="js-vehicle-name"]/@href').extract()
product_urls = ['https://www.kbb.com' + x for x in products]
for url in product_urls:
yield Request(url=url, callback=self.parse_product_page)
def parse_product_page(self, response):
car = response.xpath('//*[@id="vehicleDetails"]/div/div/h1/text()').extract_first()
price = response.xpath('//*[@id="pricingWrapper"]/span/span[@class ="price"]/text()').extract_first()
mileage = response.xpath('//*[@id="keyDetailsContent"]/div/div/div/ul/li[@class="js-mileage"]/text()').extract_first()
color = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Exterior Color:")]/text()').extract_first()
transmission = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Transmission:")]/text()').extract_first()
engine = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Engine:")]/text()').extract_first()
doors = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Doors:")]/text()').extract_first()
body = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Body Style:")]/text()').extract_first()
mpg = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Fuel Economy:")]/text()').extract_first()
drive_type = response.xpath('//*[@id="keyDetailsContent"]//li[contains(text(),"Drive Type:")]/text()').extract_first()
kbb_expert_review = response.xpath('//*[@id="readExpertReview"]/p/span[@class="title-three"]/text()').extract_first()
consumer_review = response.xpath('//*[@id="readConsumerReview"]/p/span[@class="title-three"]/text()').extract_first()
location = response.xpath('//*[@id="aboutTheDealer"]//p/text()').extract()[:2]
website = response.xpath('//*[@id="dealerDetailsModuleWebsite"]/@href').extract_first()
item = KbbItem()
item['body'] = body
item['mpg'] = mpg
item['kbb_expert_review'] = kbb_expert_review
item['consumer_review'] = consumer_review
item['car'] = car
item['price'] = price
item['mileage'] = mileage
item['color'] = color
item['transmission'] = transmission
item['engine'] = engine
item['doors'] = doors
item['location'] = location
item['website'] = website
yield item
from scrapy.exceptions import DropItem
from scrapy.exporters import CsvItemExporter
class ValidateItemPipeline(object):
def process_item(self, item, spider):
if not all(item.values()):
raise DropItem("Missing values!")
else:
return item
class KbbPipeline(object):
def __init__(self):
self.filename = 'kbb2.csv'
def open_spider(self, spider):
self.csvfile = open(self.filename, 'wb')
self.exporter = CsvItemExporter(self.csvfile)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.csvfile.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
BOT_NAME = 'kbb'
SPIDER_MODULES = ['kbb.spiders']
NEWSPIDER_MODULE = 'kbb.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 0.5
ITEM_PIPELINES = {
'kbb.pipelines.KbbPipeline': 300,
}
That’s it! Now navigate to your Python command line and enter: scrapy crawl <name of spider>
The spider bot will begin to crawl the site, collect the items you specified, and save them in a predetermined format in the root of your project directory!
When you finish your scraping chances are the data will not be ready for exploration. Here are some of the issues I had to address before I began my EDA.
Feature engineering involves creating new columns of data from existing features. For example: Taking a column of U.S. states and making a new column called Regions based off the states location, so that Connecticut would have the value “Northeast” and Oregon would be “Northwest”.
I did not employ feature engineering for this project but I may in the future
A preview of the data set:
head(df) %>% kable() %>%
kable_styling() %>% scroll_box()
condition | year | brand | make_model | body | color | consumer_review | kbb_expert_review | mileage | city_mpg | highway_mpg | gas_combined_mpg | electric_combined_mpg | price | transmission | doors | engine | location | website |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Certified | 2015 | Kia | Kia Sedona SX | Van | Berry | 7.9 | 9.0 | 441 | 18 | 25 | 21 | NA | 28971 | 6-Speed Shiftable Automatic | Four Door | 6-Cylinder | 1100 East Golf road Schaumburg IL 60173 | https://www.schaumburgkia.com/?utm_source=KBB&utm_medium=website |
Used | 2007 | Volkswagen | Volkswagen Beetle Convertible w/ Triple White | Convertible | Campanella White | 6.9 | NA | 47771 | 19 | 28 | 22 | NA | 9975 | 6-Speed Automatic | Two Door | 5-Cylinder | 4203 Franklin Common Ct Franklin TN 37067 | http://www.sunvertibles.com/?utm_source=KBB&utm_medium=website |
Used | 2014 | Cadillac | Cadillac CTS AWD Sedan | Sedan | Majestic Plum Metallic | 9.2 | 8.0 | 32066 | 20 | 30 | 23 | NA | 19833 | 6-Speed Shiftable Automatic | Four Door | 4-Cylinder Turbo | 6100 Telegraph Road Toledo OH 43612 | http://www.groganstowne.com/?stocknum=K7912010&atc_ownerid=72868&utm_source=KBB&utm_medium=website |
Used | 1993 | Chevrolet | Chevrolet Corvette Coupe | Coupe | Green | 9.0 | NA | 67488 | 15 | 22 | 18 | NA | 5928 | Automatic | Two Door | 8-Cylinder | 2236 Highway 2565 Louisa KY 41230 | http://www.bigbluemotorslouisa.com/?utm_source=KBB&utm_medium=website |
Used | 2006 | Mazda | Mazda MAZDA5 Touring | Van | Grey | 7.9 | NA | 105194 | 19 | 24 | 21 | NA | 4999 | 4-Speed Automatic | Four Door | 4-Cylinder | 9815 Evergreen Way Everett WA 98204 | http://www.baysideautosales.net/?utm_source=KBB&utm_medium=website |
Used | 2015 | FIAT | FIAT 500 e Hatchback | Hatchback | Electric Orange Tri-Coat Pearlcoat | 9.0 | 7.6 | 20579 | NA | NA | NA | 116 | 11895 | Single-Speed | Two Door | Electric | 18225 Ponderosa Drive Parker CO 80134 | http://www.rauschmotorsllc.com/?utm_source=KBB&utm_medium=website |
And now the fun part: Seeing what insights the data holds one of my favorite R packages for EDA is “XDA”. Let’s use the numSummary()
function to get statistical information on our numerical features
library(xda)
numSummary(df) %>%kable() %>%
kable_styling() %>% scroll_box()
n | mean | sd | max | min | range | nunique | nzeros | iqr | lowerbound | upperbound | noutlier | kurtosis | skewness | mode | miss | miss% | 1% | 5% | 25% | 50% | 75% | 95% | 99% | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 17651 | 2011.63 | 4.99e+00 | 2018.0 | 1981.0 | 37.0 | 38 | 0 | 7.0 | 1997.5 | 2025.5 | 226 | 2.205 | -1.310 | 2015.0 | 0 | 0.000 | 1996.0 | 2002.0 | 2008.0 | 2013.0 | 2015.0 | 2017.0 | 2018.0 |
consumer_review | 17351 | 8.48 | 6.20e-01 | 10.0 | 1.0 | 9.0 | 40 | 0 | 0.8 | 6.9 | 10.1 | 61 | 4.929 | -1.001 | 8.6 | 300 | 1.700 | 6.9 | 7.3 | 8.1 | 8.6 | 8.9 | 9.3 | 9.6 |
kbb_expert_review | 10550 | 7.94 | 9.57e-01 | 9.8 | 3.3 | 6.5 | 54 | 0 | 1.4 | 5.2 | 10.8 | 90 | 0.908 | -0.464 | NA | 7101 | 40.230 | 5.3 | 6.4 | 7.3 | 7.9 | 8.7 | 9.4 | 9.7 |
mileage | 17637 | 55586.01 | 4.37e+04 | 364000.0 | 1.0 | 363999.0 | 16277 | 0 | 59891.5 | -67176.2 | 172387.2 | 303 | 1.487 | 1.174 | NA | 14 | 0.079 | 724.4 | 5699.6 | 22661.0 | 41639.0 | 82550.0 | 139977.6 | 185112.8 |
city_mpg | 16517 | 21.09 | 5.72e+00 | 58.0 | 8.0 | 50.0 | 45 | 0 | 8.0 | 5.0 | 37.0 | 224 | 4.258 | 1.347 | 17.0 | 1134 | 6.425 | 12.0 | 14.0 | 17.0 | 20.0 | 25.0 | 30.0 | 41.0 |
highway_mpg | 16526 | 28.82 | 6.49e+00 | 53.0 | 13.0 | 40.0 | 38 | 0 | 10.0 | 9.0 | 49.0 | 6 | -0.696 | 0.159 | NA | 1125 | 6.374 | 17.0 | 19.0 | 24.0 | 28.0 | 34.0 | 39.0 | 42.0 |
gas_combined_mpg | 16604 | 23.95 | 5.91e+00 | 56.0 | 10.0 | 46.0 | 41 | 0 | 9.0 | 5.5 | 41.5 | 157 | 1.171 | 0.762 | 20.0 | 1047 | 5.932 | 14.0 | 16.0 | 19.0 | 23.0 | 28.0 | 33.0 | 41.0 |
electric_combined_mpg | 204 | 108.53 | 1.35e+01 | 133.0 | 46.0 | 87.0 | 25 | 0 | 11.0 | 88.5 | 132.5 | 19 | 6.916 | -2.416 | NA | 17447 | 98.844 | 53.0 | 88.0 | 105.0 | 114.0 | 116.0 | 119.0 | 119.0 |
price | 17619 | 17100.66 | 1.50e+04 | 482000.0 | 500.0 | 481500.0 | 6578 | 0 | 10482.2 | -5728.4 | 36200.9 | 988 | 181.155 | 9.158 | 12995.0 | 32 | 0.181 | 2995.0 | 4995.0 | 9995.0 | 13426.0 | 20477.5 | 37761.3 | 64900.0 |
Column Legend (taken from XDA github):
n= total number of rows for that variable
nunique = number of unique values
nzeroes = number of zeroes
iqr = interquartile range
noutlier = number of outliers
miss = number of rows with missing value
miss% = percentage of total rows with missing values ((miss/n)*100)
5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
. the percentile values are helpful in detecting outliers
This function is very useful for detecting the amount of outliers in a feature.
The less useful charSummary() function can provide us some insight into character-type features.
charSummary(df) %>% kable() %>%
kable_styling() %>% scroll_box()
n | miss | miss% | unique | top5levels:count | |
---|---|---|---|---|---|
condition | 17651 | 0 | 0 | 2 | Used:15781, Certified:1870 |
brand | 17651 | 0 | 0 | 55 | Ford:1846, Chevrolet:1838, Toyota:1515, Honda:1474, Nissan:1263 |
make_model | 17651 | 0 | 0 | 3128 | Honda Civic LX Sedan:215, Ford Focus SE Sedan:154, Chevrolet Malibu LT:120, Ford Fusion SE:114, Lexus ES 350 :112 |
body | 17651 | 0 | 0 | 8 | Sedan:6632, Sport Utility:5300, Hatchback:1560, Truck:1266, Coupe:1068 |
color | 17651 | 0 | 0 | 2304 | Charcoal:829, Black:459, Beige:434, Purple:413, Gold:381 |
transmission | 17651 | 0 | 0 | 37 | 6-Speed Shiftable Automatic:4799, Automatic:2489, Continuously Variable Automatic:2184, 4-Speed Automatic:1277, 5-Speed Automatic:1128 |
doors | 17651 | 0 | 0 | 5 | Four Door:14650, Two Door:2845, Three Door:106, :40, Five Door:10 |
engine | 17651 | 0 | 0 | 23 | 4-Cylinder:6528, 6-Cylinder:5342, 8-Cylinder:2116, 4-Cylinder Turbo:1889, 6-Cylinder Turbo:394 |
location | 17651 | 0 | 0 | 4413 | 4319 Rt.130 South Edgewater Park NJ 08010:89, 333 Broadway Avenue Bedford OH 44146:88, Nationwide shipping to San Francisco CA 94109:85, :79, 21151 NW 2ND AVE MIAMI FL 33169:78 |
website | 17651 | 0 | 0 | 10168 | :275, http://www.floridafinecars.com/?utm_source=KBB&utm_medium=website:111, https://www.carvana.com/?utm_source=autotrader&utm_medium=referral&utm_campaign=autotrader-referral-ATL&utm_content=vdp-visitwebsite:110, http://www.autodirectcars.com/?utm_source=KBB&utm_medium=website:89, http://www.ncautoauction.com/?utm_source=autotrader&utm_medium=referral&utm_content=visit-dealer-site&utm_campaign=big-orng-btn:88 |
Here we see that the Honda Civic LX Sedan is the most frequent make and model being sold. “Charcoal”" is the most frequent car color, “6-Speed Shiftable Automatic” is the most frequent transmission and so on.
Now that we have collected about 17,000 used car details from KBB.com, let’s see what brands dominate used car listings.
Of all the cars I collected (your results may vary), the most frequent brands were Ford, Chevy, Toyota and Honda. Sedan and Sport Utility vehicles are very frequent in used car listings.Let’s take a look at how ratings between car owners and Kelley Blue Book experts differ.
Here we can see that consumers are a little more lenient in how they rate a car and will give a car higher marks while experts seem to be more critical.
Most cars are priced between $11k to $13k. There is a substantial drop of in cars valued over $15k. Some luxury vehicles selling over $450,000 are outliers and need to be removed for predictive modelling. (Try zooming in)
Most cars have mileages between 15,000-30,000 miles.
Here we see something interesting. Most cars being sold were built in 2015: They are only three years old. Why is this? According to Edmunds Used Car Report “24% of franchise used sales were 3 years old in Q1 2018”. Once a three year lease is up, many owners look to sell their car, causing a market saturation of 3-year old cars.
In this next section we will train a Multilinear Regression model with Ridge regularization to predict the price of used cars from data the model has never seen.
library(gradDescent)
library(data.table)
library(dplyr)
library(ISLR)
library(glmnet)
library(MASS)
library(car)
df = fread('C:/Users/pache_000/Desktop/kbb/KBB_used_final_3.csv')
#80/20 Train-Test split
splitDF = splitData(df, 0.8, seed = 1)
train = splitDF$dataTrain
test = splitDF$dataTest
For simplicity sake, I will you only use the quantitative features in this demonstration.
quantdf = select_if(train, is.numeric)
quantdftest = select_if(test, is.numeric)
quantdf$electric_combined_mpg = NULL
quantdftest$electric_combined_mpg = NULL
quantdf = na.omit(quantdf)
quantdftest = na.omit(quantdftest)
#initial model
model = lm(price ~ . , data = quantdf)
…Removing outliers
quantdf = quantdf[-c(2359, 6371, 3701, 2526, 2916),]
quantdf = quantdf[-c(1933, 274, 6731),]
quantdf = quantdf[-c(3124, 3708, 2005, 5534),]
…applying boxcox transformation to our model
model.transformed = boxcox(model, lambda = seq(-2, 2, 1/10), plotit = TRUE,
interp = T, eps = 1/50, xlab = expression(lambda),
ylab = "log-Likelihood")
#Extracting best lambda by finding the maximum log-likelihood.
lambda = model.transformed$x[which(model.transformed$y == max(model.transformed$y))]
#defining a boxcox function
bc <- function (obs, lambda) {(obs^lambda-1)/lambda }
#Use mapply() to apply the bc function to our dataset using our best lambda
quant_df_bc = mapply(bc,quantdf,lambda)
#convert quant_df_bc to dataframe
quant_df_bc = data.frame(quant_df_bc)
#new multilinear model that uses our boxcox transfomed dataframe
model.bc = lm(price ~ ., data = quant_df_bc)
plot(model.bc)
summary(model.bc)
##
## Call:
## lm(formula = price ~ ., data = quant_df_bc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.021862 -0.002376 -0.000045 0.002434 0.015795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.90e+02 3.45e+00 -55.09 < 2e-16 ***
## year 8.49e+01 1.52e+00 55.79 < 2e-16 ***
## consumer_review 2.76e-02 1.50e-03 18.48 < 2e-16 ***
## kbb_expert_review 2.25e-02 9.09e-04 24.72 < 2e-16 ***
## mileage -1.68e-03 5.73e-04 -2.93 0.0034 **
## city_mpg 3.48e-02 6.32e-03 5.50 3.9e-08 ***
## highway_mpg -4.10e-02 4.89e-03 -8.38 < 2e-16 ***
## gas_combined_mpg -8.11e-02 1.03e-02 -7.90 3.2e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0039 on 8022 degrees of freedom
## Multiple R-squared: 0.655, Adjusted R-squared: 0.655
## F-statistic: 2.18e+03 on 7 and 8022 DF, p-value: <2e-16
vif(model.bc)
## year consumer_review kbb_expert_review mileage
## 1.30 1.05 1.27 1.06
## city_mpg highway_mpg gas_combined_mpg
## 81.77 29.93 173.34
coef(model.bc)
## (Intercept) year consumer_review kbb_expert_review
## -1.90e+02 8.49e+01 2.76e-02 2.25e-02
## mileage city_mpg highway_mpg gas_combined_mpg
## -1.68e-03 3.48e-02 -4.10e-02 -8.11e-02
x = model.matrix(price ~ ., data = quant_df_bc)[, -1] #Dropping the intercept column.
y = quant_df_bc$price
#define grid for our ridge regression
grid = 10^seq(3, -5, length = 100)
#create our ridge model using x, y, and grid.
ridge.models = glmnet(x, y, alpha = 0, lambda = grid)
plot(ridge.models, xvar = "lambda", label = TRUE, main = "Ridge Regression")
set.seed(0)
traindf = sample(1:nrow(x), 8*nrow(x)/10)
testdf = (-traindf)
y.test = y[testdf]
#Double check 70/30 train/test split
#length(traindf)/nrow(x)
#length(y.test)/nrow(x)
…Running 10-fold cross validation.
set.seed(0)
cv.ridge.out = cv.glmnet(x[traindf, ], y[traindf],
lambda = grid, alpha = 0, nfolds = 10)
plot(cv.ridge.out, main = "Ridge Regression\n")
bestlambda.ridge = cv.ridge.out$lambda.min
#bestlambda.ridge # 0.00001 best lamda for our ridge regression
#log(bestlambda.ridge) # -11.51293 log of best lamda for our ridge regression
ridge.bestlambdatrain = predict.cv.glmnet(cv.ridge.out, s ="lambda.min", newx = x[testdf, ])
#mean((ridge.bestlambdatrain - y.test)^2) #MSE = 0.0000156
#What is the test RMSE associated with this best value of lambda?
#sqrt(mean((ridge.bestlambdatrain - y.test)^2)) #RMSE = .003
What is the accuracy of our model after 10-fold cross validation?
compare <- cbind(actual = y.test, ridge.bestlambdatrain)
compare <- data.frame(compare)
compare = compare %>% rename(., predicted = X1)
#Undo Boxcox transformation
invBoxCox <- function(x, lambda){
if (lambda == 0) exp(x) else (lambda*x + 1)^(1/lambda) }
compare_norm = data.frame(mapply(invBoxCox,compare,lambda))
#mean(apply(compare_norm, 1, min)/apply(compare_norm, 1, max))
Metrics::rmsle(compare_norm[,1], compare_norm[,2])
## [1] 0.262
library(highcharter)
hchart(density(compare_norm$actual), type = "area", color = "#B71C1C", name = "actual price")%>% hc_add_series(density(compare_norm$predicted), area = TRUE, name = "predicted") %>% hc_add_theme(hc_theme_db()) %>% hc_title(., text = "Actual Car Price vs. Predicted")
quant_dftest_bc = mapply(bc,quantdftest,lambda)
#convert quant_df_bc to dataframe
quant_dftest_bc = data.frame(quant_dftest_bc)
quant_dftest_bc =na.omit(quant_dftest_bc)
library(lmridge)
ridge.final = lmridge(price ~., data = quant_df_bc , lambda=lambda)
test_prediction = data.frame(predict.lmridge(ridge.final, newdata = quant_dftest_bc))
compare_test = data.frame(mapply(invBoxCox, test_prediction, lambda))
price = data.frame(quantdftest$price)
myPrediction = cbind(price, compare_test)
Root mean squared log error(RMSLE):
#Accuracy:
#mean(apply(myPrediction, 1, min)/apply(myPrediction, 1, max))
library(Metrics)
Metrics::rmsle(myPrediction$quantdftest.price, myPrediction$predict.lmridge.ridge.final..newdata...quant_dftest_bc.)
## [1] 0.263