Skip to content

Using the 100 Million Photos and Videos database from Flickr, how can we predict travel patterns in the United States and Central America?

Notifications You must be signed in to change notification settings

mm-wang/flickrtravel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Travel Patterns Using Flickr

Using the 100 Million Photos and Videos database from Flickr to predict travel patterns within the United States and Central America.

The Hypothesis:

People typically travel to take photographs, or go to a specific place to take photographs. Even if it is their backyard, it is a place that has meaning and visual attraction. I am interested in looking at photography as a predictor of ideal locations to travel to. Where do people like to take photographs? Where will people like to take photographs?

Where will people travel?

The Preprocessing/Cleaning/Manipulation

The Flickr database consists of the following:

  • Photo/video ID
  • User NSID, User nickname
  • Date taken
  • Date uploaded
  • Capture device
  • Title, Description
  • User tags (comma-separated), Machine tags (comma-separated)
  • Longitude, Latitude
  • Accuracy
  • Photo/video page URL, Photo/video download URL
  • License name, License URL
  • Photo/video server identifier, Photo/video farm identifier
  • Photo/video secret, Photo/video secret original
  • Photo/video extension original
  • Photos/video marker (0 = photo, 1 = video)

Cleaning consisted of the following steps:

  • Taking out any cameras with "scan" in the name
  • Binning the rest of the camera brands, putting any that occur less than 1% of the time into a category "Other"

Visual Explorations

Through explorations of the camera brands apparent in the dataset, it is clear that there is a growth of Canon cameras over time, although the introduction of the Apple iPhone in 2007 quickly brings Apple into the ring for contention.

cb2006
cb2007

Clustering Optimization and Analysis

This analysis focused on the United States and Central America, and K-Means Clustering was used to break up the area into regions. To develop the optimal number of clusters, a silhouette score was assigned to a range of clusters. Using the scores as a guideline, the final number of clusters selected was 15.

kmsil kmclusters

Linear Regression

The points were grouped into each cluster, and used that to create the set of time series below, sorted by region. On average, the R-squared values were 86.2%, with a root mean square error of 11.9%, using a time-slice of five years to predict each sixth year.

Far West West Central East
Alaska Pacific Northwest Northern Mountains Northeast
Western Canada California Rocky Mountains Mid-Atlantic
Hawaii Southwest Great Lakes Southeast
                  |[Central America]	  |[South]               |[Caribbean]

What Will Happen in 2019?

Based on the analysis, the Pacific Northwest will be the most popular place, holding its status from 2000 onward. The least popular locations will be Hawaii and the South. There will be a growing trend in visits to Central America, and to California.

Pacific Northwest Pacific Northwest

Central America Central America

California California

Hawaii Hawaii

Next Steps

This analysis has been based on a simple K-Means clustering, with the number of clusters fine tuned. It also has been sliced into a simple year by year time series, and analyzed using linear regression.

More Diversified Data

Using a database of only Flickr photos introduces biases to the data and the prediction. For example, the relative popularity of Flickr has evolved and peaked around 2010-2011, and has noticeably declined. The rise of various photo-sharing services such as Instagram, Twitter, Facebook, etc. have affected the total photos uploaded to Flickr.

To improve the prediction, the information from these sources would need to be added and adjusted. There will continue to be biases based on the demographics of each user base, and how the services are used.

More Models

It would be interesting to find a method of applying K-Medians to the area, to find the more dense locations.

The model used above to predict on the number of photos is a linear regression model from statsmodels. I also used support vector regression and linear support vector regression models to check. They were less stable in the face of the limited data, and produced less accurate forecasts.

Smaller Time Slices

To gain granularity, it would be prudent to block out the pictures by month, and gain more noise but also more information to define the forecast accurately.

Until then, enjoy!

About

Using the 100 Million Photos and Videos database from Flickr, how can we predict travel patterns in the United States and Central America?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages