Explore and reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.
In this project, we need to reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.
Processes include: - create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph, - build network graph visualisation using appropriate functions of ggraph, - compute network geometrics using tidygraph, - build advanced graph visualisation by incorporating the network geometrics, and - build interactive network visualisation using visNetwork package.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
Here four network data modelling and visualisation packages will be installed and launched. They are igraph, tidygraph, ggraph and visNetwork. Beside these four packages, tidyverse and lubridate, an R package specially designed to handle and wrangling time data will be installed and launched too.
The chunk code on the right will do the trick.
packages = c('igraph', 'tidygraph',
'ggraph', 'visNetwork',
'lubridate', 'clock',
'tidyverse', 'graphlayouts')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
There are two data sets needed. One contains the nodes data and the other contains the edges (also know as link) data.
In this step, we will import .csv into RStudio environment by using read_csv() of readr package.
edges <- read_csv("data/SocialNetwork.csv")
nodes <- read_csv("data/Participants.csv")
It is always a good practice to examine the imported data frame before further analysis is performed.
Let’s take an overview of the datasets, we will examine the structure of the data frame using glimpse() of dplyr.
glimpse(edges)
glimpse(nodes)
We will aggregate the participants by from, to, date, weekday, week and the working-day/weekends.
workdays <- c('Monday','Tuesday','Wednesday','Thursday','Friday')
edges <- edges %>%
mutate(from = participantIdFrom,
to = participantIdTo) %>%
mutate(weekday = wday(timestamp,
label = TRUE,
abbr = FALSE)) %>%
mutate(month = month(timestamp,
label = FALSE)) %>%
mutate(week = lubridate::week(timestamp)) %>%
mutate(when = case_when(weekday %in% workdays ~ "Working-day",
TRUE ~ "Weekends")) %>%
select(from,to,timestamp,weekday,week,when)
- Rename ‘HighSchoolOrCollege’
nodes$educationLevel <- sub('HighSchoolOrCollege',
'High School or College',
nodes$educationLevel)
- Rename columns
nodes <- nodes %>%
rename('ID' = 'participantId',
'HouseholdSize' = 'householdSize',
'HaveKids' = 'haveKids',
'Age' = 'age',
'EducationLevel' = 'educationLevel',
'InterestGroup' = 'interestGroup',
'Joviality' = 'joviality')
- Age Binning
Age variable is binned with the following code chunk:
- Write and Read rds Files
edges <- read_rds('data/rds/edges.rds')
nodes <- read_rds('data/rds/nodes.rds')
Firstly, let’s have a look of the general social network situation during the whole investigated period.
- The Edges Data
Four functions from dplyr package are used here. They are: filter(), group(), summarise(), and ungroup().
The output data.frame is called edges_aggregated. A new field called Weight has been added in edges_aggregated.
# A tibble: 160,842 x 3
from to Weight
<dbl> <dbl> <int>
1 0 27 14
2 0 33 16
3 0 37 20
4 0 53 15
5 0 72 16
6 0 123 17
7 0 226 16
8 0 300 18
9 0 316 34
10 0 341 17
# ... with 160,832 more rows
How is the distribution of the social network?
hist(edges_aggregated$Weight)
We can observe that during this period the highest frequency one socialize with others is higher than 400 and as the count of edges increases, the frequency decreases, which means most people does not like to socialize with others.
So, let’s select edges with weight higher than 200 - those participants is more actively socialized during this period.
edges_aggregated <- edges_aggregated %>%
filter(Weight >200)
Now, we have 3 columns and 4210 records of socialization.
glimpse(edges_aggregated)
Rows: 4,210
Columns: 3
$ from <dbl> 2, 2, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6,~
$ to <dbl> 69, 176, 29, 92, 113, 221, 882, 953, 82, 136, 203, 10~
$ Weight <int> 244, 279, 206, 355, 315, 397, 209, 383, 276, 357, 275~
- The Nodes Data
For the edges data, we need to find those participants who have been in the 4210 records we selected.
Same functions are used here to get the network fro both working days and weekends.
- The Edges Data
# A tibble: 317,900 x 4
from to when Weight
<dbl> <dbl> <chr> <int>
1 0 27 Weekends 4
2 0 27 Working-day 10
3 0 33 Weekends 4
4 0 33 Working-day 12
5 0 37 Weekends 5
6 0 37 Working-day 15
7 0 53 Weekends 5
8 0 53 Working-day 10
9 0 72 Weekends 4
10 0 72 Working-day 12
# ... with 317,890 more rows
Code chunk below gives the distribution of participants’ socialize situation in working-days and weekends.
hist(edges_byWork$Weight)
We can observe that the frequency of socialization of majority participants is less than 100. We then have a look at those with weights more than 100.
edges_byWork <- edges_byWork %>%
filter(Weight >100)
glimpse(edges_byWork)
Rows: 9,986
Columns: 4
$ from <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4,~
$ to <dbl> 555, 675, 731, 69, 136, 149, 176, 202, 218, 412, 660,~
$ when <chr> "Working-day", "Working-day", "Working-day", "Working~
$ Weight <int> 127, 113, 101, 176, 106, 102, 198, 125, 125, 103, 105~
- The Nodes Data
For the edges data, we need to find those participants who have been in the 9986 records we selected.
We will create a graph data model by using tidygraph package. It provides a tidy API for graph/network manipulation. While network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.
Two functions of tidygraph package can be used to create network objects, they are:
graph_aggregated <- graph_from_data_frame (edges_aggregated,
vertices = nodes_aggregated) %>%
as_tbl_graph()
graph_aggregated
# A tbl_graph: 763 nodes and 4210 edges
#
# A directed simple graph with 24 components
#
# Node Data: 763 x 8 (active)
name HouseholdSize HaveKids Age EducationLevel InterestGroup
<chr> <dbl> <lgl> <dbl> <chr> <chr>
1 2 3 TRUE 35 High School o~ A
2 4 3 TRUE 43 Bachelors H
3 5 3 TRUE 32 High School o~ D
4 6 3 TRUE 26 High School o~ I
5 7 3 TRUE 27 Bachelors A
6 8 3 TRUE 20 Bachelors G
# ... with 757 more rows, and 2 more variables: Joviality <dbl>,
# AgeGroup <chr>
#
# Edge Data: 4,210 x 3
from to Weight
<int> <int> <int>
1 1 56 244
2 1 146 279
3 2 24 206
# ... with 4,207 more rows
We will plot Network Data with ggraph package - an extension of ggplot2, making it easier to carry over basic ggplot skills to the design of network graphs.
As in all network graph, there are three main aspects to a ggraph’s network graph, they are:nodes,edges and layouts.
set_graph_style()
g <- ggraph(graph_aggregated,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.01, .5)) +
geom_node_point(aes(colour = EducationLevel),
size = 0.4) +
theme(legend.position = 'bottom')
g
g + facet_nodes(~EducationLevel) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
We can detect that there’re mainly 3 most significant network clusters.
It seems that participants with High School or College education level are more active in social activities.
set_graph_style()
g <- ggraph(graph_aggregated,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.01, .5)) +
geom_node_point(aes(colour = InterestGroup),
size = 0.4) +
theme(legend.position = 'bottom')
g
g + facet_nodes(~InterestGroup) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
We can detect that there are also mainly 3 most significant network clusters.
Participants in different interest group perform differently in different clusters in terms their activeness in joining social activities.
set_graph_style()
g <- ggraph(graph_aggregated,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.01, .5)) +
geom_node_point(aes(colour = AgeGroup),
size = 0.4) +
theme(legend.position = 'bottom')
g
g + facet_nodes(~AgeGroup) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
We can detect that there are also mainly 3 most significant network clusters.
Participants in different age groups perform differently in 3 clusters in terms their activeness in joining social activities.
According to whether it’s working day or weekends.
graph_work <- graph_from_data_frame (edges_byWork,
vertices = nodes_work) %>%
as_tbl_graph()
graph_work
# A tbl_graph: 834 nodes and 9986 edges
#
# A directed multigraph with 17 components
#
# Node Data: 834 x 8 (active)
name HouseholdSize HaveKids Age EducationLevel InterestGroup
<chr> <dbl> <lgl> <dbl> <chr> <chr>
1 1 3 TRUE 25 High School o~ B
2 2 3 TRUE 35 High School o~ A
3 3 3 TRUE 21 High School o~ I
4 4 3 TRUE 43 Bachelors H
5 5 3 TRUE 32 High School o~ D
6 6 3 TRUE 26 High School o~ I
# ... with 828 more rows, and 2 more variables: Joviality <dbl>,
# AgeGroup <chr>
#
# Edge Data: 9,986 x 4
from to when Weight
<int> <int> <chr> <int>
1 1 505 Working-day 127
2 1 582 Working-day 113
3 1 636 Working-day 101
# ... with 9,983 more rows
set_graph_style()
g <- ggraph(graph_work,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.01, .5)) +
geom_node_point(aes(colour = EducationLevel),
size = 0.4)
g
g + facet_edges(~when) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
g + facet_nodes(~EducationLevel) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
There are more connections - more social activities in working days.
set_graph_style()
g1 <- ggraph(graph_work,
layout = "nicely") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.01, .5)) +
geom_node_point(aes(colour = AgeGroup),
size = 0.4) +
theme(legend.position = 'bottom')
g1
g1 + facet_edges(~when) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
g1 + facet_nodes(~AgeGroup) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
What i learned: - create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph, - build network graph visualisation using appropriate functions of ggraph, - compute network geometrics using tidygraph, - build advanced graph visualisation by incorporating the network geometrics, and - build interactive network visualisation using visNetwork package.