Introduction

In this study, I want to find out the trend in the number of creators on the social media platform Flickr, as well as the creators’ sharing nature photos trend. The sharing pattern of nature photos may indicate the presence of some weather events.


Needed Packages

Run the following codes to import packages necessary in this project.

library(dplyr)  # for working with dataframes
library(lubridate)  # for working with time data
library(forecast)  # for predicting time series data
library(ggplot2)  # for plotting
library(gganimate)  # for plotting animated figures

Data Preparing

The social media data used here is from Flickr. They are photos taken in Michigan in recent twenty years.

First read the data.

MichFlickr <- read.csv('../data/lab3/MichiganFlickr.csv')  # read the csv file
head(MichFlickr)  # show the first few lines of the data
##   X fid cat field_1_1     id
## 1 1   1  58        58 101778
## 2 2   2  59        59 101779
## 3 3   3  60        60 101780
## 4 4   4  61        61 101781
## 5 5   5  62        62 101782
## 6 6   6  63        63 101783
##                                                        url dateupload latitude
## 1 https://farm1.staticflickr.com/1/101778_37e4536da6_m.jpg 1090510921 45.08638
## 2 https://farm1.staticflickr.com/1/101779_562306283b_m.jpg 1090510923 45.08633
## 3 https://farm1.staticflickr.com/1/101780_a04f7cfa78_m.jpg 1090510926 45.08633
## 4 https://farm1.staticflickr.com/1/101781_d0fc02c3bf_m.jpg 1090510930 45.08633
## 5 https://farm1.staticflickr.com/1/101782_7db5046b1f_m.jpg 1090510932 45.08633
## 6 https://farm1.staticflickr.com/1/101783_e4ec3882ff_m.jpg 1090510936 45.08633
##   longitude           owner           title
## 1 -84.17036 44124269523@N01    Roaring Fire
## 2 -84.17069 44124269523@N01 Geese Reflected
## 3 -84.17069 44124269523@N01   The Night Sky
## 4 -84.17069 44124269523@N01       Sunset #1
## 5 -84.17069 44124269523@N01       Sunset #2
## 6 -84.17069 44124269523@N01       Sunset #3
##                                                     url_sq predict_Na Landuse
## 1 https://farm1.staticflickr.com/1/101778_37e4536da6_s.jpg  0.3283439      41
## 2 https://farm1.staticflickr.com/1/101779_562306283b_s.jpg  0.9205280      41
## 3 https://farm1.staticflickr.com/1/101780_a04f7cfa78_s.jpg  0.9658680      41
## 4 https://farm1.staticflickr.com/1/101781_d0fc02c3bf_s.jpg  0.9744647      41
## 5 https://farm1.staticflickr.com/1/101782_7db5046b1f_s.jpg  0.9365814      41
## 6 https://farm1.staticflickr.com/1/101783_e4ec3882ff_s.jpg  0.9627538      41

As we can see, the dateupload attribute is a huge integer. It represents the upload time in seconds from 1970-01-01. We need to convert it to a legible format using the functions in the lubridate package. The as.POSIXct() function can be used to turn the continuous seconds into a legible time format, and the function as.Date() can set the format of the time.

MichFlickr$time <- as.POSIXct(MichFlickr$dateupload, origin='1970-01-01')  # change seconds into a legible time data
MichFlickr$date <- as.Date(format(MichFlickr$time, format='%Y-%m-%d'))  # set the date format
# separately create year, month, and day attribute
MichFlickr$year <- year(MichFlickr$time)
MichFlickr$month <- month(MichFlickr$time, label = TRUE)
MichFlickr$day <- day(MichFlickr$time)

The predict_Na attribute gives an index of how possible the picture is a picture of nature. Let’s say if predict_Na\(>0.6\), then it is likely to be a photo of nature.

# set nature threshold and create an attribute telling whether this is a photo of the nature
MichFlickr$Nature<- MichFlickr$predict_Na > 0.6

Also, for future spatial analyses, I need the Michigan’s border data. This can be downloaded through the map_data() function in the ggplot2 package.

states <- map_data("state")  # get the state data
mich <- subset(states, region == "michigan")  # subset the states to get only Michigan
counties <- map_data("county")  # get the counties data
mich_county <- subset(counties, region == "michigan")  # subset the counties to get only those in Michigan
# get the map
pmich <- ggplot(data=mich,
                mapping=aes(x=long, y=lat, group=group)) +  # set xy and group the peninsulas to prevent wierd lines
  coord_fixed(1.3) +  # make the x 1.3 times greater than y
  geom_polygon(color='black', fill='skyblue') +  # set the border and fill color
  geom_polygon(data = mich_county, fill = NA, color = "white") +  # set the counties border color
  geom_polygon(color = "black", fill = NA) +  # get the state border back on top
  labs(title='Map of Michigan', x = "Longitude", y = "Latitude")
pmich +
  theme(plot.title = element_text(size = 20, color = "black", face='bold', hjust=0.5))  # set title format


Noise Reduction

Social media data include various sources of noise related to the frequency of sharing photographs that can obscure meaningful patterns. Here I will develop a technique for noise reduction that surpasses basic filtering methods. This technique includes three steps.

  1. Time Range Check
  2. Robot Detection
  3. (Optional) Remove duplicate posts from one user

1 Time Range Check

According to the Wikipedia page of Flickr, it is a platform launched on Feb 10, 2004. So the posts before this time must be noise and need to be eliminated. Therefore, I will set a time threshold. The lubridate::as.Date() function can be used to turn character variable into date time variable.

denoise_mich <- MichFlickr %>%
  filter(time >= as.Date('2004-2-10'))  # mask out posts earlier than 2004-2-10

2 Robot Detection

The analysis of social media needs to address the issue of fake or robot accounts that can add to noise and bias in the result. Here I detect robot accounts based on the account’s post behavior: whether is posting contents with a fixed interval.

nrow_before_derobo <- nrow(denoise_mich)  # record the rows before removing the robot accounts
user_interval <- denoise_mich %>%  # get interval information
  group_by(owner) %>%  # group records from the same owner
  filter(n()>=20) %>%  # find robots from users who have posted >=20 times
  arrange(owner, time) %>%  # sort the records by owner and time
  mutate(interval=c(0,diff(time)))  # compute time interval (sec) in each group (give 0 to the first post)
interval_stats <- user_interval %>%  # get interval statistics
  summarise(publish_count=n(),  # get publish count
            interval_pattern = as.numeric(names(which.max(table(interval)))),  # most frequent interval duration
            interval_freq=as.numeric(table(interval)[which.max(table(interval))]))  # get same interval frequency
head(interval_stats)
## # A tibble: 6 × 4
##   owner         publish_count interval_pattern interval_freq
##   <chr>                 <int>            <dbl>         <dbl>
## 1 100000053@N05            28                1            19
## 2 100022158@N04            44                1            26
## 3 100117149@N06            84               19            12
## 4 100131648@N04            22               10             2
## 5 10023112@N00            290                9            30
## 6 100250160@N08            20                9             6

Human accounts should not have a strong pattern of posting. If most of the interval between each post is the same, the account is more likely to be a robot. Here I would say if an account has more than 60% of its contents are posted with the same interval, it is a robot.

robot <- interval_stats %>%  # get robot blacklist
  filter(interval_freq/publish_count>0.6 & interval_pattern>1) %>%  # 60% same interval and the interval > 1 sec
  arrange(desc(publish_count))  # sort the records by publish count
head(robot)
## # A tibble: 6 × 4
##   owner        publish_count interval_pattern interval_freq
##   <chr>                <int>            <dbl>         <dbl>
## 1 59637929@N03           110                2            82
## 2 96739540@N00            90                2            62
## 3 67773453@N05            88                2            54
## 4 30389278@N00            50                2            33
## 5 77598288@N00            42                3            38
## 6 10856322@N06            41                8            25
# now do anti join to delete the robot accounts in the original data
denoise_mich <- anti_join(denoise_mich, robot, by='owner')  # remove robot accounts
nrow_before_derobo-nrow(denoise_mich)  # compute how many are removed
## [1] 494

494 records, which are likely to be noise, are deleted.

3 (Optional) Remove duplicate posts from one user

When analyzing the trend in creators on the social media platform, we want to find out how many people are posting contents on Flickr. So, we do not want to take into account one user having multiple posts within a time period we set.

# function of repetition detection
rm_rep_post <- function(df,time_range){  # time_range --> 'd' for day, 'm' for month, 'y' for year
if (time_range == 'd') {  # different time ranges
    df <- df %>%
      group_by(owner, date) %>%
      slice(1) %>%  # leave only the first record in each group
      ungroup()
  } else if (time_range == 'm') {
    df <- df %>%
      group_by(owner, year, month) %>%
      slice(1) %>%
      ungroup()
  } else if (time_range == 'y') {
    df <- df %>%
      group_by(owner, year) %>%
      slice(1) %>%
      ungroup()
  } else {
    warning("Invalid time range specified.")
  }
  return(df)
}

Data Analysis

1 Flickr Creator Trend Analysis

Analyzing the evolution of creator number of a social media is important because it identifies periods of growth and decline, providing insights into the platform’s overall health. First analyze the creator trend on Flickr every year.

denoise_mich$count <- 1 # create a count attribute for adding up
# yearly creator trend analysis
yearly_creator <- denoise_mich %>%
  rm_rep_post('y') %>%  # remove the repeat posts from one creator in a year
  group_by(year) %>%  # group the records by year
  summarise(total_creator=sum(count)) %>%  # get count sum
  na.omit()  # omit the NAs
head(yearly_creator)
## # A tibble: 6 × 2
##    year total_creator
##   <dbl>         <dbl>
## 1  2004            36
## 2  2005           298
## 3  2006           918
## 4  2007          1965
## 5  2008          2422
## 6  2009          2854
#plot the yearly trend in creators
ggplot(yearly_creator, aes(x = year, y = total_creator)) +  # set x and y axes
  geom_line(color = "red", linewidth = 1) +  # plot line
  scale_x_continuous(breaks = yearly_creator$year) +  # show every year on x axis
  theme(axis.text.x = element_text(angle=0)) +  # set x axis text angle 
  labs(title='Yearly Trend in Creators on Flickr in Michigan', x = "Year", y = "Number of Creators")

As evident from the yearly trend, the number of Flickr creators in Michigan increased from 2004 to 2011, reaching its peak at around 4000 individuals in the years 2011 to 2013, and began to decline in the subsequent years. By the year 2018, there were only approximately 400 Flickr creators in Michigan.

Also, we can analysis the monthly creator trend every year.

# analyze monthly trend
monthly_creator <- denoise_mich %>%
  rm_rep_post('m') %>%
  group_by(year,month) %>%
  summarise(total_creator=sum(count)) %>%
  na.omit()
head(monthly_creator)
## # A tibble: 6 × 3
## # Groups:   year [1]
##    year month total_creator
##   <dbl> <ord>         <dbl>
## 1  2004 Mar               2
## 2  2004 Apr               1
## 3  2004 May               1
## 4  2004 Jun               2
## 5  2004 Jul               3
## 6  2004 Aug               2
# plot
monthly_creator  %>%
  ggplot(aes(x = month, y = total_creator, group = year)) +
  geom_line(aes(color = as.factor(year))) +
  scale_color_discrete() + 
  labs(title = "Monthly Trend Every Year in Creators on Flickr in Michigan",
       x = "Month", y = "Number of Creators") +
  scale_color_discrete(name = "Year") +  # set legend title
  theme_classic()

As shown in the graph, during summer times, the number of creators are the most.

2 Flickr Creator Trend Prediction

I also want to predict the future trend in creators on Flickr in Michigan. Here I get the daily creator trend and then do a prediction using Autoregressive Integrated Moving Average (ARIMA) modeling.

# analyze daily trend
daily_creator <- denoise_mich %>%
  rm_rep_post('d') %>%
  group_by(date) %>%
  summarise(total_creator=sum(count)) %>%
  na.omit()
# plot daily trend
ggplot(daily_creator, aes(x = date, y = total_creator)) +  # set x and y axes
  geom_line(color = "red", linewidth = 1) +  # plot line
  theme(axis.text.x = element_text(angle=0)) +  # set x axis text angle 
  geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), se = FALSE, color='yellow') +
  labs(title='Daily Trend in Creators on Flickr in Michigan', x = "Time", y = "Number of Creators")

# predict future trend
daily_creator <- arrange(daily_creator,date)
creator_ts <- ts(daily_creator$total_creator,start=2004,end=2017,frequency=365)  # convert to time series
arima_model <- auto.arima(creator_ts)  # get model
plot(arima_model$x,col="red")  # see the fit of the model: plot real data
lines(fitted(arima_model),col="blue")  # see the fit of the model: plot predicted

fore_arima = forecast::forecast(arima_model, h=(3*365))  # predict future trend
plot(fore_arima,ylim=c(0,125),xlab='Year',ylab='Number of Creators')

As shown in the graphs, the number of creators will not exceed the peak in the following 3 years.

3 Spatial Temporal Analysis of Nature Photos

There was a tornado in Michigan in 2012. This tornado affected the southeast of Michigan on and before March 15, 2012. I want to find out the impact of weather on users’ sharing nature photos.

animateMich <- denoise_mich %>%
  filter(date >= as.Date('2012-03-01') & date <= as.Date('2012-3-31') & Nature==T)  # set time limit and leave only nature photos
pani <-pmich +
        geom_polygon(data = mich_county, fill = NA, color = "white") +
        geom_polygon(color = "black", fill = NA) + 
        geom_point(data = animateMich, aes(longitude, latitude), inherit.aes = FALSE) +
        labs(title = 'Date: {format(frame_time, "%b %d %Y")}') +
        transition_time(date)
animate(pani + shadow_wake(0.1), fps=5)  # get gif result

As shown in the GIF, there is a decline in the number of nature photos in southeast Michigan during days before and on March 15, 2012, compared to the following days and the beginning of March. This can be explained by the presence of a tornado, highlighting the potential use of social media data to analyze the spatial and temporal track of severe weather events.