Using the 100 Million Photos and Videos database from Flickr to predict travel patterns within the United States and Central America.
People typically travel to take photographs, or go to a specific place to take photographs. Even if it is their backyard, it is a place that has meaning and visual attraction. I am interested in looking at photography as a predictor of ideal locations to travel to. Where do people like to take photographs? Where will people like to take photographs?
The Flickr database consists of the following:
- Photo/video ID
- User NSID, User nickname
- Date taken
- Date uploaded
- Capture device
- Title, Description
- User tags (comma-separated), Machine tags (comma-separated)
- Longitude, Latitude
- Accuracy
- Photo/video page URL, Photo/video download URL
- License name, License URL
- Photo/video server identifier, Photo/video farm identifier
- Photo/video secret, Photo/video secret original
- Photo/video extension original
- Photos/video marker (0 = photo, 1 = video)
Cleaning consisted of the following steps:
- Taking out any cameras with "scan" in the name
- Binning the rest of the camera brands, putting any that occur less than 1% of the time into a category "Other"
Through explorations of the camera brands apparent in the dataset, it is clear that there is a growth of Canon cameras over time, although the introduction of the Apple iPhone in 2007 quickly brings Apple into the ring for contention.
This analysis focused on the United States and Central America, and K-Means Clustering was used to break up the area into regions. To develop the optimal number of clusters, a silhouette score was assigned to a range of clusters. Using the scores as a guideline, the final number of clusters selected was 15.
The points were grouped into each cluster, and used that to create the set of time series below, sorted by region. On average, the R-squared values were 86.2%, with a root mean square error of 11.9%, using a time-slice of five years to predict each sixth year.
| Far West | West | Central | East |
|---|---|---|---|
| Alaska | Pacific Northwest | Northern Mountains | Northeast |
| Western Canada | California | Rocky Mountains | Mid-Atlantic |
| Hawaii | Southwest | Great Lakes | Southeast |
|[Central America] |[South] |[Caribbean]
Based on the analysis, the Pacific Northwest will be the most popular place, holding its status from 2000 onward. The least popular locations will be Hawaii and the South. There will be a growing trend in visits to Central America, and to California.
This analysis has been based on a simple K-Means clustering, with the number of clusters fine tuned. It also has been sliced into a simple year by year time series, and analyzed using linear regression.
Using a database of only Flickr photos introduces biases to the data and the prediction. For example, the relative popularity of Flickr has evolved and peaked around 2010-2011, and has noticeably declined. The rise of various photo-sharing services such as Instagram, Twitter, Facebook, etc. have affected the total photos uploaded to Flickr.
To improve the prediction, the information from these sources would need to be added and adjusted. There will continue to be biases based on the demographics of each user base, and how the services are used.
It would be interesting to find a method of applying K-Medians to the area, to find the more dense locations.
The model used above to predict on the number of photos is a linear regression model from statsmodels. I also used support vector regression and linear support vector regression models to check. They were less stable in the face of the limited data, and produced less accurate forecasts.
To gain granularity, it would be prudent to block out the pictures by month, and gain more noise but also more information to define the forecast accurately.
Until then, enjoy!







