For the advanced portion of our project, we wanted to create a predictive model that represented the wildfire data. More specifically, we wanted to see how well we could classify fire size given a set of features. Because the datasets are so large, supervised learning was the way to go - the dataset already provided the labels ( FIRE_SIZE_CLASS). The heavy lifting required lied in cleaning the data, extracting features, and splitting the dataset into appropriate chunks. Pandas, Scikit, and NumPy were the tools used to complete this section of the project. You can view the details of how we created the classifier in the source code at /wildfires/


There were a lot of columns that were extraneous for the purposes of classification, so the only columns we used were: STAT_CAUSE_DESCR STAT_CAUSE_CODE FIRE_SIZE_CLASS FIRE_SIZE STATE. Rows that had any of these columns missing were removed from the training/testing sets because we wanted to execute on as complete data as possible. Doing this removed further skewing, as we would have had to impute missing data, which doesn't really make sense without deeper knowledge about wildfires in the U.S. Finally, classification was performed on the dataset separated by year.

Extracting Features

Features used: STAT_CAUSE_CODE season region

Before training with features, we first had to extract features from the original dataset. Some columns were used as is (like STAT_CAUSE_CODE) and others had to be defined from existing data, like season, and region.

There was a bit of experimentation with different features before settling on the above set. The reason for using STAT_CAUSE_CODE is fairly self-explanatory. With season, we figured that there might be more fires and perhaps bigger fires during the summer due to dryness and hotter temperatures in certain areas like California. region was chosen because different regions have different environments, resulting in different factors that might go into affecting a fire. Furthermore,


Once finished with feature extraction, we had to split the feature vectors and labels into training and testing sets. A Multinomial Naive Bayes (because labels are not binary) was fit using an 80-20 split.

The classifier did not perform as well as we would have liked it to. For example, in 1995, the classifier accuracy (calculated as number of correctly matched labels over total size) was 54.34%. After further investigation into the confusion matrix, the classifier was predicting only two class sizes, A and B. This is most likely due to the fact that there are so many more small fire sizes, compared to larger ones. As we learned in class, Naive Bayes does not work well with skewed data.

From this evaluation, we experimented with different features and labels. For example, instead of using the exact cause of the fire, we can separate a cause into two buckets: natural vs. human. Unfortunately, this performed roughly as well as the original evaluation.

Interestingly, we were able to get good classification by handling the labels differently, while using the original features. Instead of directly using the class sizes, the sizes can be bucketed into small (A through C), medium (D through F), and large (G). Doing so yielded "good" results - the classifier predicted with more than 90% accuracy. But, looking at the confusion matrix to see where the predictions were true and false, all the predictions ended up in the "small" bucket. Again, we predict that this is due to heavy skewing due to the number of small fires.

Overall, there is definitely room to improve. Perhaps we could have chosen a more normalized sample of the dataset to train on, instead of training on the skewed data. Or maybe we could have chosen better features and extracted features into more fitting buckets. Because predicting a fire size is a fairly unusual characteristic to "predict", and the fact that a fire's context (i.e, date, cause, season, etc.) might in fact be somewhat dependent on each other, a Naive Bayes model may not have been the best choice to use for our classifier.

Results from 1992 to 2015

Using these features: STAT_CAUSE_CODE, region, season

Year Classifier Accuracy (correct predictions / total predictions)
1992 0.5889536497363308
1993 0.5378188870654931
1994 0.5426583668985719
1995 0.5434930112634008
1996 0.5427273813792279
1997 0.5469655972168534
1998 0.5377805229565575
1999 0.5019501441410887
2000 0.5327040680670034
2001 0.5414752769521751
2002 0.5483597285067874
2003 0.5648555276381909
2004 0.5752069089600576
2005 0.5238738738738739
2006 0.5077815879715115
2007 0.5183937096321258
2008 0.518005540166205
2009 0.5418986008725741
2010 0.49703138252756573
2011 0.5185913325925293
2012 0.4716962621406848
2013 0.5426836002208725
2014 0.5307467057101025
2015 0.4898793119832747


Use the following form to predict a fire size class for a particular year.