For the advanced portion of our project, we wanted to create a predictive model that represented the wildfire data. More
specifically, we wanted to see how well we could classify fire size given a set of features. Because the datasets are
so large, supervised learning was the way to go - the dataset already provided the labels (
FIRE_SIZE_CLASS). The heavy lifting required lied in cleaning the data, extracting features, and splitting the dataset into appropriate
Scikit, and NumPy were the tools used to complete this section of the project. You can view the details of how we
created the classifier in the source code at /wildfires/wildfire_classifier.py
There were a lot of columns that were extraneous for the purposes of classification, so the only columns we used were:
STATE. Rows that had any of these columns missing were removed from the training/testing sets because we wanted to execute
on as complete data as possible. Doing this removed further skewing, as we would have had to impute missing data, which
doesn't really make sense without deeper knowledge about wildfires in the U.S. Finally, classification was performed
on the dataset separated by year.
Before training with features, we first had to extract features from the original dataset. Some columns were used as is (like
STAT_CAUSE_CODE) and others had to be defined from existing data, like
There was a bit of experimentation with different features before settling on the above set. The reason for using
STAT_CAUSE_CODE is fairly self-explanatory. With
season, we figured that there might be more fires and perhaps bigger fires during the summer due to dryness and hotter temperatures
in certain areas like California.
region was chosen because different regions have different environments, resulting in different factors that might go into
affecting a fire. Furthermore,
Once finished with feature extraction, we had to split the feature vectors and labels into training and testing sets. A Multinomial Naive Bayes (because labels are not binary) was fit using an 80-20 split.
The classifier did not perform as well as we would have liked it to. For example, in 1995, the classifier accuracy (calculated as number of correctly matched labels over total size) was 54.34%. After further investigation into the confusion matrix, the classifier was predicting only two class sizes, A and B. This is most likely due to the fact that there are so many more small fire sizes, compared to larger ones. As we learned in class, Naive Bayes does not work well with skewed data.
From this evaluation, we experimented with different features and labels. For example, instead of using the exact cause of the fire, we can separate a cause into two buckets: natural vs. human. Unfortunately, this performed roughly as well as the original evaluation.
Interestingly, we were able to get good classification by handling the labels differently, while using the original features. Instead of directly using the class sizes, the sizes can be bucketed into small (A through C), medium (D through F), and large (G). Doing so yielded "good" results - the classifier predicted with more than 90% accuracy. But, looking at the confusion matrix to see where the predictions were true and false, all the predictions ended up in the "small" bucket. Again, we predict that this is due to heavy skewing due to the number of small fires.
Overall, there is definitely room to improve. Perhaps we could have chosen a more normalized sample of the dataset to train on, instead of training on the skewed data. Or maybe we could have chosen better features and extracted features into more fitting buckets. Because predicting a fire size is a fairly unusual characteristic to "predict", and the fact that a fire's context (i.e, date, cause, season, etc.) might in fact be somewhat dependent on each other, a Naive Bayes model may not have been the best choice to use for our classifier.
Results from 1992 to 2015
Using these features: STAT_CAUSE_CODE, region, season
|Year||Classifier Accuracy (correct predictions / total predictions)|
Use the following form to predict a fire size class for a particular year.