In this study, I want to find out the trend in the number of creators on the social media platform Flickr, as well as the creators’ sharing nature photos trend. The sharing pattern of nature photos may indicate the presence of some weather events.
Run the following codes to import packages necessary in this project.
library(dplyr) # for working with dataframes
library(lubridate) # for working with time data
library(forecast) # for predicting time series data
library(ggplot2) # for plotting
library(gganimate) # for plotting animated figures
The social media data used here is from Flickr. They are photos taken in Michigan in recent twenty years.
First read the data.
MichFlickr <- read.csv('../data/lab3/MichiganFlickr.csv') # read the csv file
head(MichFlickr) # show the first few lines of the data
## X fid cat field_1_1 id
## 1 1 1 58 58 101778
## 2 2 2 59 59 101779
## 3 3 3 60 60 101780
## 4 4 4 61 61 101781
## 5 5 5 62 62 101782
## 6 6 6 63 63 101783
## url dateupload latitude
## 1 https://farm1.staticflickr.com/1/101778_37e4536da6_m.jpg 1090510921 45.08638
## 2 https://farm1.staticflickr.com/1/101779_562306283b_m.jpg 1090510923 45.08633
## 3 https://farm1.staticflickr.com/1/101780_a04f7cfa78_m.jpg 1090510926 45.08633
## 4 https://farm1.staticflickr.com/1/101781_d0fc02c3bf_m.jpg 1090510930 45.08633
## 5 https://farm1.staticflickr.com/1/101782_7db5046b1f_m.jpg 1090510932 45.08633
## 6 https://farm1.staticflickr.com/1/101783_e4ec3882ff_m.jpg 1090510936 45.08633
## longitude owner title
## 1 -84.17036 44124269523@N01 Roaring Fire
## 2 -84.17069 44124269523@N01 Geese Reflected
## 3 -84.17069 44124269523@N01 The Night Sky
## 4 -84.17069 44124269523@N01 Sunset #1
## 5 -84.17069 44124269523@N01 Sunset #2
## 6 -84.17069 44124269523@N01 Sunset #3
## url_sq predict_Na Landuse
## 1 https://farm1.staticflickr.com/1/101778_37e4536da6_s.jpg 0.3283439 41
## 2 https://farm1.staticflickr.com/1/101779_562306283b_s.jpg 0.9205280 41
## 3 https://farm1.staticflickr.com/1/101780_a04f7cfa78_s.jpg 0.9658680 41
## 4 https://farm1.staticflickr.com/1/101781_d0fc02c3bf_s.jpg 0.9744647 41
## 5 https://farm1.staticflickr.com/1/101782_7db5046b1f_s.jpg 0.9365814 41
## 6 https://farm1.staticflickr.com/1/101783_e4ec3882ff_s.jpg 0.9627538 41
As we can see, the dateupload
attribute is a huge
integer. It represents the upload time in seconds from 1970-01-01. We
need to convert it to a legible format using the functions in the
lubridate
package. The as.POSIXct()
function
can be used to turn the continuous seconds into a legible time format,
and the function as.Date()
can set the format of the
time.
MichFlickr$time <- as.POSIXct(MichFlickr$dateupload, origin='1970-01-01') # change seconds into a legible time data
MichFlickr$date <- as.Date(format(MichFlickr$time, format='%Y-%m-%d')) # set the date format
# separately create year, month, and day attribute
MichFlickr$year <- year(MichFlickr$time)
MichFlickr$month <- month(MichFlickr$time, label = TRUE)
MichFlickr$day <- day(MichFlickr$time)
The predict_Na
attribute gives an index of how possible
the picture is a picture of nature. Let’s say if
predict_Na
\(>0.6\),
then it is likely to be a photo of nature.
# set nature threshold and create an attribute telling whether this is a photo of the nature
MichFlickr$Nature<- MichFlickr$predict_Na > 0.6
Also, for future spatial analyses, I need the Michigan’s border data.
This can be downloaded through the map_data()
function in
the ggplot2
package.
states <- map_data("state") # get the state data
mich <- subset(states, region == "michigan") # subset the states to get only Michigan
counties <- map_data("county") # get the counties data
mich_county <- subset(counties, region == "michigan") # subset the counties to get only those in Michigan
# get the map
pmich <- ggplot(data=mich,
mapping=aes(x=long, y=lat, group=group)) + # set xy and group the peninsulas to prevent wierd lines
coord_fixed(1.3) + # make the x 1.3 times greater than y
geom_polygon(color='black', fill='skyblue') + # set the border and fill color
geom_polygon(data = mich_county, fill = NA, color = "white") + # set the counties border color
geom_polygon(color = "black", fill = NA) + # get the state border back on top
labs(title='Map of Michigan', x = "Longitude", y = "Latitude")
pmich +
theme(plot.title = element_text(size = 20, color = "black", face='bold', hjust=0.5)) # set title format
Social media data include various sources of noise related to the frequency of sharing photographs that can obscure meaningful patterns. Here I will develop a technique for noise reduction that surpasses basic filtering methods. This technique includes three steps.
According to the Wikipedia page of
Flickr, it is a platform launched on Feb 10, 2004. So the posts
before this time must be noise and need to be eliminated. Therefore, I
will set a time threshold. The lubridate::as.Date()
function can be used to turn character variable into date time
variable.
denoise_mich <- MichFlickr %>%
filter(time >= as.Date('2004-2-10')) # mask out posts earlier than 2004-2-10
The analysis of social media needs to address the issue of fake or robot accounts that can add to noise and bias in the result. Here I detect robot accounts based on the account’s post behavior: whether is posting contents with a fixed interval.
nrow_before_derobo <- nrow(denoise_mich) # record the rows before removing the robot accounts
user_interval <- denoise_mich %>% # get interval information
group_by(owner) %>% # group records from the same owner
filter(n()>=20) %>% # find robots from users who have posted >=20 times
arrange(owner, time) %>% # sort the records by owner and time
mutate(interval=c(0,diff(time))) # compute time interval (sec) in each group (give 0 to the first post)
interval_stats <- user_interval %>% # get interval statistics
summarise(publish_count=n(), # get publish count
interval_pattern = as.numeric(names(which.max(table(interval)))), # most frequent interval duration
interval_freq=as.numeric(table(interval)[which.max(table(interval))])) # get same interval frequency
head(interval_stats)
## # A tibble: 6 × 4
## owner publish_count interval_pattern interval_freq
## <chr> <int> <dbl> <dbl>
## 1 100000053@N05 28 1 19
## 2 100022158@N04 44 1 26
## 3 100117149@N06 84 19 12
## 4 100131648@N04 22 10 2
## 5 10023112@N00 290 9 30
## 6 100250160@N08 20 9 6
Human accounts should not have a strong pattern of posting. If most of the interval between each post is the same, the account is more likely to be a robot. Here I would say if an account has more than 60% of its contents are posted with the same interval, it is a robot.
robot <- interval_stats %>% # get robot blacklist
filter(interval_freq/publish_count>0.6 & interval_pattern>1) %>% # 60% same interval and the interval > 1 sec
arrange(desc(publish_count)) # sort the records by publish count
head(robot)
## # A tibble: 6 × 4
## owner publish_count interval_pattern interval_freq
## <chr> <int> <dbl> <dbl>
## 1 59637929@N03 110 2 82
## 2 96739540@N00 90 2 62
## 3 67773453@N05 88 2 54
## 4 30389278@N00 50 2 33
## 5 77598288@N00 42 3 38
## 6 10856322@N06 41 8 25
# now do anti join to delete the robot accounts in the original data
denoise_mich <- anti_join(denoise_mich, robot, by='owner') # remove robot accounts
nrow_before_derobo-nrow(denoise_mich) # compute how many are removed
## [1] 494
494 records, which are likely to be noise, are deleted.
When analyzing the trend in creators on the social media platform, we want to find out how many people are posting contents on Flickr. So, we do not want to take into account one user having multiple posts within a time period we set.
# function of repetition detection
rm_rep_post <- function(df,time_range){ # time_range --> 'd' for day, 'm' for month, 'y' for year
if (time_range == 'd') { # different time ranges
df <- df %>%
group_by(owner, date) %>%
slice(1) %>% # leave only the first record in each group
ungroup()
} else if (time_range == 'm') {
df <- df %>%
group_by(owner, year, month) %>%
slice(1) %>%
ungroup()
} else if (time_range == 'y') {
df <- df %>%
group_by(owner, year) %>%
slice(1) %>%
ungroup()
} else {
warning("Invalid time range specified.")
}
return(df)
}
Analyzing the evolution of creator number of a social media is important because it identifies periods of growth and decline, providing insights into the platform’s overall health. First analyze the creator trend on Flickr every year.
denoise_mich$count <- 1 # create a count attribute for adding up
# yearly creator trend analysis
yearly_creator <- denoise_mich %>%
rm_rep_post('y') %>% # remove the repeat posts from one creator in a year
group_by(year) %>% # group the records by year
summarise(total_creator=sum(count)) %>% # get count sum
na.omit() # omit the NAs
head(yearly_creator)
## # A tibble: 6 × 2
## year total_creator
## <dbl> <dbl>
## 1 2004 36
## 2 2005 298
## 3 2006 918
## 4 2007 1965
## 5 2008 2422
## 6 2009 2854
#plot the yearly trend in creators
ggplot(yearly_creator, aes(x = year, y = total_creator)) + # set x and y axes
geom_line(color = "red", linewidth = 1) + # plot line
scale_x_continuous(breaks = yearly_creator$year) + # show every year on x axis
theme(axis.text.x = element_text(angle=0)) + # set x axis text angle
labs(title='Yearly Trend in Creators on Flickr in Michigan', x = "Year", y = "Number of Creators")
As evident from the yearly trend, the number of Flickr creators in Michigan increased from 2004 to 2011, reaching its peak at around 4000 individuals in the years 2011 to 2013, and began to decline in the subsequent years. By the year 2018, there were only approximately 400 Flickr creators in Michigan.
Also, we can analysis the monthly creator trend every year.
# analyze monthly trend
monthly_creator <- denoise_mich %>%
rm_rep_post('m') %>%
group_by(year,month) %>%
summarise(total_creator=sum(count)) %>%
na.omit()
head(monthly_creator)
## # A tibble: 6 × 3
## # Groups: year [1]
## year month total_creator
## <dbl> <ord> <dbl>
## 1 2004 Mar 2
## 2 2004 Apr 1
## 3 2004 May 1
## 4 2004 Jun 2
## 5 2004 Jul 3
## 6 2004 Aug 2
# plot
monthly_creator %>%
ggplot(aes(x = month, y = total_creator, group = year)) +
geom_line(aes(color = as.factor(year))) +
scale_color_discrete() +
labs(title = "Monthly Trend Every Year in Creators on Flickr in Michigan",
x = "Month", y = "Number of Creators") +
scale_color_discrete(name = "Year") + # set legend title
theme_classic()
As shown in the graph, during summer times, the number of creators are the most.
I also want to predict the future trend in creators on Flickr in Michigan. Here I get the daily creator trend and then do a prediction using Autoregressive Integrated Moving Average (ARIMA) modeling.
# analyze daily trend
daily_creator <- denoise_mich %>%
rm_rep_post('d') %>%
group_by(date) %>%
summarise(total_creator=sum(count)) %>%
na.omit()
# plot daily trend
ggplot(daily_creator, aes(x = date, y = total_creator)) + # set x and y axes
geom_line(color = "red", linewidth = 1) + # plot line
theme(axis.text.x = element_text(angle=0)) + # set x axis text angle
geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), se = FALSE, color='yellow') +
labs(title='Daily Trend in Creators on Flickr in Michigan', x = "Time", y = "Number of Creators")
# predict future trend
daily_creator <- arrange(daily_creator,date)
creator_ts <- ts(daily_creator$total_creator,start=2004,end=2017,frequency=365) # convert to time series
arima_model <- auto.arima(creator_ts) # get model
plot(arima_model$x,col="red") # see the fit of the model: plot real data
lines(fitted(arima_model),col="blue") # see the fit of the model: plot predicted
fore_arima = forecast::forecast(arima_model, h=(3*365)) # predict future trend
plot(fore_arima,ylim=c(0,125),xlab='Year',ylab='Number of Creators')
As shown in the graphs, the number of creators will not exceed the peak in the following 3 years.
There was a tornado in Michigan in 2012. This tornado affected the southeast of Michigan on and before March 15, 2012. I want to find out the impact of weather on users’ sharing nature photos.
animateMich <- denoise_mich %>%
filter(date >= as.Date('2012-03-01') & date <= as.Date('2012-3-31') & Nature==T) # set time limit and leave only nature photos
pani <-pmich +
geom_polygon(data = mich_county, fill = NA, color = "white") +
geom_polygon(color = "black", fill = NA) +
geom_point(data = animateMich, aes(longitude, latitude), inherit.aes = FALSE) +
labs(title = 'Date: {format(frame_time, "%b %d %Y")}') +
transition_time(date)
animate(pani + shadow_wake(0.1), fps=5) # get gif result
As shown in the GIF, there is a decline in the number of nature photos in southeast Michigan during days before and on March 15, 2012, compared to the following days and the beginning of March. This can be explained by the presence of a tornado, highlighting the potential use of social media data to analyze the spatial and temporal track of severe weather events.