Have you ever had a creative project in mind but lacked the funds to bring it to life?

Maybe you want to publish a cookbook, film a documentary, or release an album. With the growing popularity of crowd-funding sites, that idea doesn’t have to be a daydream. Kickstarter has grown into one of the most popular crowd-funding platforms on the web since its inception in 2009. Over 140,000 projects have been funded thanks to the patronage of millions of Kickstarter backers. From smart watches to card games about combustible felines, Kickstarter has actualized projects that have impacted technology, design, and pop culture.

So you have your idea, and you are ready to deploy your Kickstarter page…

Not to deter you, but only about 36% of projects get funded. So what is the magic formula to get your project fully funded? What factors effect whether a project succeeds or fails? This data set, provided by Kaggle, can give us insight into Kickstarter projects launched between 2009 and 2018…

Preliminary data cleaning and feature engineering

library(dplyr)
library(data.table)
library(kableExtra)
library(e1071)
library(MLmetrics)

kickstarter <- fread('kickstarter.csv', header=T)
## 
Read 34.3% of 378661 rows
Read 52.8% of 378661 rows
Read 76.6% of 378661 rows
Read 378661 rows and 15 (of 15) columns from 0.054 GB file in 00:00:06
kickstarter$ID <- as.character(kickstarter$ID)
# remove 7 observations that have incorrect launch dates (year says "1970")
kickstarter = kickstarter[c(-2843, -48148, -75398, -94580, -247914, -273780, -319003),]

Feature Engineering:

Does the season affect project success or failure? In my EDA of the Kickstarter data set in my R shiny app (see the calendar graph), I noticed that more projects tend to be launched in the summer months (June-September). Maybe winter holidays veer attention away from kickstarter projects. Or people are less likely to back a project when they have to spend their money on holiday gifts.

Let’s add a column for duration of the project (in days).

Does the duration of the project determine its success or failure? Super short projects might not have a chance to generate enough buzz. Long projects might deter people from backing because the deadline is no where in sight.

Let’s add another column that tells the season in which the project was launched.

# covert deadline values to date type
kickstarter$deadline <- as.Date(kickstarter$deadline, "%Y-%m-%d")
#covert launched values to date type
kickstarter$launched <- as.Date(kickstarter$launched, '%Y-%m-%d %H:%M:%S')
# add a new column for project duration
kickstarter$project_duration_days <- kickstarter$deadline - kickstarter$launched


#write a function to extract the season from a datetime object
getSeason <- function(DATES) {
    WS <- as.Date("2012-12-15", format = "%Y-%m-%d") # Winter Solstice
    SE <- as.Date("2012-3-15",  format = "%Y-%m-%d") # Spring Equinox
    SS <- as.Date("2012-6-15",  format = "%Y-%m-%d") # Summer Solstice
    FE <- as.Date("2012-9-15",  format = "%Y-%m-%d") # Fall Equinox

    # Convert dates from any year to 2012 dates
    d <- as.Date(strftime(DATES, format="2012-%m-%d"))

    ifelse (d >= WS | d < SE, "Winter",
      ifelse (d >= SE & d < SS, "Spring",
        ifelse (d >= SS & d < FE, "Summer", "Fall")))
}

kickstarter = kickstarter %>% mutate(., season_launched = getSeason(launched))

A sample of our dataset

head(kickstarter) %>% kable() %>%
  kable_styling() %>% scroll_box()
ID name category main_category currency deadline goal launched pledged state backers country usd pledged usd_pledged_real usd_goal_real project_duration_days season_launched
1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 1000 2015-08-11 0 failed 0 GB 0 0 1533.95 59 days Summer
1000003930 Greeting From Earth: ZGAC Arts Capsule For ET Narrative Film Film & Video USD 2017-11-01 30000 2017-09-02 2421 failed 15 US 100 2421 30000.00 60 days Summer
1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 45000 2013-01-12 220 failed 3 US 220 220 45000.00 45 days Winter
1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 5000 2012-03-17 1 failed 1 US 1 1 5000.00 30 days Spring
1000011046 Community Film Project: The Art of Neighborhood Filmmaking Film & Video Film & Video USD 2015-08-29 19500 2015-07-04 1283 canceled 14 US 1283 1283 19500.00 56 days Summer
1000014025 Monarch Espresso Bar Restaurants Food USD 2016-04-01 50000 2016-02-26 52375 successful 224 US 52375 52375 50000.00 35 days Winter

Using A Support Vector Machine to predict whether a kickstarter project will succeed or fail

Is there a way we can predict the fate of a Kickstarter project before it is launched. It turns out we can. By training a predictive machine learning model, we can apply that model to new data to determine whether a project succeeds or fails. In the next steps, I train a Support Vector Machine model (with a radial kernel) on a randomized subset of the data to handle this task. I trained the model using features I thought that had the most direct impact on project fate: The main category, the subcategory, the project goal (in USD), the country where the project is based, the duration of the project in days, and the season when the project was launched.

#select the features we believe have an impact on project success or failure
subset1 = kickstarter %>% select(state,  main_category, category, usd_goal_real, country, project_duration_days, season_launched)

#filter the state varible to just successful or failed projects.  (Other states exist like live, cancelled, etc.) 
subset1 = subset1 %>% filter(., state=="successful" | state =="failed")

#Factorize categorical varibles:
subset1$state = as.factor(subset1$state)
subset1$category = as.factor(subset1$category)
subset1$main_category = as.factor(subset1$main_category)
subset1$country = as.factor(subset1$country)
subset1$season_launched = as.factor(subset1$season_launched)

subset1$project_duration_days = as.numeric(subset1$project_duration_days)

#omit NA values
subset1 = na.omit(subset1)

#Binarize state variables
subset1$state <- ifelse(subset1$state=="successful", 1, 0)
#select a sample of 35,000 observations selected at random (no replacement) from the dataset
subset_sample = subset1[sample(nrow(subset1), 35000),]

set.seed(15)
index <- 1:nrow(subset_sample)
testindex <- sample(index, trunc(length(index)/3))

testset <- subset_sample[testindex,]
trainset <- subset_sample[-testindex,]

svm.model <- svm(state ~ . , data = trainset, cost = 75, gamma = 1)

#Test the model on the test set
svm.pred <- predict(svm.model, testset[,-1])



#generate a data table of predicted vs. actual
x= data.frame(svm.pred)
y = data.frame(testset[,1])

prediction = cbind(x,y)

Finding ROC AUC

In classification problems, it is better to evaluate predictions based on probabilities. For example, instead predicting whether a project succeeds or fail, it is a better practice to determine the probability of whether a project will succeed or fail. An ROC-AUC can help visualize this.

Let’s first see the arean under curve value and the confidence interval of the AUC:

######pROC####
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_obj <- roc(prediction$testset...1., prediction$svm.pred)
auc(roc_obj)
## Area under the curve: 0.6332
ci(roc_obj)
## 95% CI: 0.6229-0.6435 (DeLong)
plot(roc_obj, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2), grid.col=c("green", "red"),max.auc.polygon=TRUE, auc.polygon.col="skyblue", print.thres=FALSE)

Our SVM model has an auc of 0.6389. Aucs greater than 0.5 tell us our model is predicting better than just a random guess. However, an AUC of 0.64 leaves much in the way of model improvements.

Future steps

To increase our model accuracy we can:

  1. Adjust SVM hyper-parameters (cost and gamma)
  2. Employ a model ensemble
  3. Try a different classification model (e.g. random forest)
  4. Do more feature engineering to uncover hidden features that impact project outcomes