Take-home Exercise 6

Explore and reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.

Huan Li https://linkedin.com/in/huan-li-ab7498124/ (SMU, SCIS, Master of IT in Business)https://scis.smu.edu.sg/master-it-business/about-mitb-main
2022-06-06

1. Overview

In this project, we need to reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.

Processes include: - create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph, - build network graph visualisation using appropriate functions of ggraph, - compute network geometrics using tidygraph, - build advanced graph visualisation by incorporating the network geometrics, and - build interactive network visualisation using visNetwork package.

2. Required libraries and datasets

2.1 Load required packages

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

Here four network data modelling and visualisation packages will be installed and launched. They are igraph, tidygraph, ggraph and visNetwork. Beside these four packages, tidyverse and lubridate, an R package specially designed to handle and wrangling time data will be installed and launched too.

The chunk code on the right will do the trick.

hide
packages = c('igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'tidyverse', 'graphlayouts')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Importing Data

There are two data sets needed. One contains the nodes data and the other contains the edges (also know as link) data.

In this step, we will import .csv into RStudio environment by using read_csv() of readr package.

hide
edges <- read_csv("data/SocialNetwork.csv")
nodes <- read_csv("data/Participants.csv")

It is always a good practice to examine the imported data frame before further analysis is performed.

Let’s take an overview of the datasets, we will examine the structure of the data frame using glimpse() of dplyr.

hide
glimpse(edges)
glimpse(nodes)

3. Data Wrangling

3.1 Preprocess the Edges Data

We will aggregate the participants by from, to, date, weekday, week and the working-day/weekends.

hide
workdays <- c('Monday','Tuesday','Wednesday','Thursday','Friday')

edges <- edges %>% 
  mutate(from = participantIdFrom,
         to = participantIdTo) %>% 
  mutate(weekday = wday(timestamp,
                        label = TRUE,
                        abbr = FALSE)) %>% 
  mutate(month = month(timestamp,
                       label = FALSE)) %>%
  mutate(week = lubridate::week(timestamp)) %>%
  mutate(when = case_when(weekday %in% workdays ~ "Working-day",
                          TRUE ~ "Weekends")) %>% 
  select(from,to,timestamp,weekday,week,when)

3.2 Preprocess the Nodes Data

- Rename ‘HighSchoolOrCollege’

hide
nodes$educationLevel <- sub('HighSchoolOrCollege',
                            'High School or College',
                            nodes$educationLevel)

- Rename columns

hide
nodes <- nodes %>% 
  rename('ID' = 'participantId', 
         'HouseholdSize' = 'householdSize', 
         'HaveKids' = 'haveKids', 
         'Age' = 'age', 
         'EducationLevel' = 'educationLevel', 
         'InterestGroup' = 'interestGroup', 
         'Joviality' = 'joviality')

- Age Binning

Age variable is binned with the following code chunk:

hide
breaks <- c(17, 20, 25, 30, 35, 40, 45, 50, 55, 60)
groups <- c('20 & Below', '21-25', '26-30',
            '31-35', '36-40', '41-45',
            '46-50', '51-55', '56-60')
nodes$AgeGroup <- cut(nodes$Age, 
                      breaks=breaks, 
                      labels = groups)

- Write and Read rds Files

hide
saveRDS(edges, "data/rds/edges.rds")
saveRDS(nodes, "data/rds/nodes.rds")
hide
edges <- read_rds('data/rds/edges.rds')
nodes <- read_rds('data/rds/nodes.rds')

3.3 Network Data During Whole Period

Firstly, let’s have a look of the general social network situation during the whole investigated period.

- The Edges Data

Four functions from dplyr package are used here. They are: filter(), group(), summarise(), and ungroup().

The output data.frame is called edges_aggregated. A new field called Weight has been added in edges_aggregated.

hide
edges_aggregated <- edges %>% 
  group_by(from,to) %>% 
  summarise(Weight = n()) %>% 
  filter(from!=to) %>% 
  filter(Weight > 1) %>% 
  ungroup()
edges_aggregated
# A tibble: 160,842 x 3
    from    to Weight
   <dbl> <dbl>  <int>
 1     0    27     14
 2     0    33     16
 3     0    37     20
 4     0    53     15
 5     0    72     16
 6     0   123     17
 7     0   226     16
 8     0   300     18
 9     0   316     34
10     0   341     17
# ... with 160,832 more rows

How is the distribution of the social network?

hide
hist(edges_aggregated$Weight)

We can observe that during this period the highest frequency one socialize with others is higher than 400 and as the count of edges increases, the frequency decreases, which means most people does not like to socialize with others.

So, let’s select edges with weight higher than 200 - those participants is more actively socialized during this period.

hide
edges_aggregated <- edges_aggregated %>% 
  filter(Weight >200)

Now, we have 3 columns and 4210 records of socialization.

hide
glimpse(edges_aggregated)
Rows: 4,210
Columns: 3
$ from   <dbl> 2, 2, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6,~
$ to     <dbl> 69, 176, 29, 92, 113, 221, 882, 953, 82, 136, 203, 10~
$ Weight <int> 244, 279, 206, 355, 315, 397, 209, 383, 276, 357, 275~

- The Nodes Data

For the edges data, we need to find those participants who have been in the 4210 records we selected.

hide
nodes_aggregated <- nodes %>%
  filter (ID %in% c(edges_aggregated$from,
                    edges_aggregated$to))

3.4 Network Data in Working-days/weekends

Same functions are used here to get the network fro both working days and weekends.

- The Edges Data

hide
edges_byWork <- edges %>% 
  group_by(from,to,when) %>% 
  summarise(Weight = n()) %>% 
  filter(from!=to) %>% 
  filter(Weight > 1) %>% 
  ungroup()
edges_byWork
# A tibble: 317,900 x 4
    from    to when        Weight
   <dbl> <dbl> <chr>        <int>
 1     0    27 Weekends         4
 2     0    27 Working-day     10
 3     0    33 Weekends         4
 4     0    33 Working-day     12
 5     0    37 Weekends         5
 6     0    37 Working-day     15
 7     0    53 Weekends         5
 8     0    53 Working-day     10
 9     0    72 Weekends         4
10     0    72 Working-day     12
# ... with 317,890 more rows

Code chunk below gives the distribution of participants’ socialize situation in working-days and weekends.

hide
hist(edges_byWork$Weight)

We can observe that the frequency of socialization of majority participants is less than 100. We then have a look at those with weights more than 100.

hide
edges_byWork <- edges_byWork %>% 
  filter(Weight >100)
hide
glimpse(edges_byWork)
Rows: 9,986
Columns: 4
$ from   <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4,~
$ to     <dbl> 555, 675, 731, 69, 136, 149, 176, 202, 218, 412, 660,~
$ when   <chr> "Working-day", "Working-day", "Working-day", "Working~
$ Weight <int> 127, 113, 101, 176, 106, 102, 198, 125, 125, 103, 105~

- The Nodes Data

For the edges data, we need to find those participants who have been in the 9986 records we selected.

hide
nodes_work <- nodes %>%
  filter (ID %in% c(edges_byWork$from,
                    edges_byWork$to))

4. Visulisation and Insights

4.1 High-weightage Network During Whole Period

We will create a graph data model by using tidygraph package. It provides a tidy API for graph/network manipulation. While network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.

Two functions of tidygraph package can be used to create network objects, they are:

hide
graph_aggregated <- graph_from_data_frame (edges_aggregated,
                                           vertices = nodes_aggregated) %>%
  as_tbl_graph()

graph_aggregated
# A tbl_graph: 763 nodes and 4210 edges
#
# A directed simple graph with 24 components
#
# Node Data: 763 x 8 (active)
  name  HouseholdSize HaveKids   Age EducationLevel InterestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 2                 3 TRUE        35 High School o~ A            
2 4                 3 TRUE        43 Bachelors      H            
3 5                 3 TRUE        32 High School o~ D            
4 6                 3 TRUE        26 High School o~ I            
5 7                 3 TRUE        27 Bachelors      A            
6 8                 3 TRUE        20 Bachelors      G            
# ... with 757 more rows, and 2 more variables: Joviality <dbl>,
#   AgeGroup <chr>
#
# Edge Data: 4,210 x 3
   from    to Weight
  <int> <int>  <int>
1     1    56    244
2     1   146    279
3     2    24    206
# ... with 4,207 more rows

4.1.1 Social Network between Different Education Level

We will plot Network Data with ggraph package - an extension of ggplot2, making it easier to carry over basic ggplot skills to the design of network graphs.

As in all network graph, there are three main aspects to a ggraph’s network graph, they are:nodes,edges and layouts.

hide
set_graph_style()
g <- ggraph(graph_aggregated, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = EducationLevel), 
                  size = 0.4) +
  theme(legend.position = 'bottom')

g
hide
g + facet_nodes(~EducationLevel) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

We can detect that there’re mainly 3 most significant network clusters.

It seems that participants with High School or College education level are more active in social activities.

4.1.2 Social Network between Different Interest Group

hide
set_graph_style()
g <- ggraph(graph_aggregated, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = InterestGroup), 
                  size = 0.4) +
  theme(legend.position = 'bottom')
g
hide
g + facet_nodes(~InterestGroup) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

We can detect that there are also mainly 3 most significant network clusters.

Participants in different interest group perform differently in different clusters in terms their activeness in joining social activities.

4.1.3 Social Network between Different Age Group

hide
set_graph_style()
g <- ggraph(graph_aggregated, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = AgeGroup), 
                  size = 0.4) +
  theme(legend.position = 'bottom')
g
hide
g + facet_nodes(~AgeGroup) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

We can detect that there are also mainly 3 most significant network clusters.

Participants in different age groups perform differently in 3 clusters in terms their activeness in joining social activities.

4.2 Network during Working-days and Weekends

According to whether it’s working day or weekends.

hide
graph_work <- graph_from_data_frame (edges_byWork,
                                           vertices = nodes_work) %>%
  as_tbl_graph()

graph_work
# A tbl_graph: 834 nodes and 9986 edges
#
# A directed multigraph with 17 components
#
# Node Data: 834 x 8 (active)
  name  HouseholdSize HaveKids   Age EducationLevel InterestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 1                 3 TRUE        25 High School o~ B            
2 2                 3 TRUE        35 High School o~ A            
3 3                 3 TRUE        21 High School o~ I            
4 4                 3 TRUE        43 Bachelors      H            
5 5                 3 TRUE        32 High School o~ D            
6 6                 3 TRUE        26 High School o~ I            
# ... with 828 more rows, and 2 more variables: Joviality <dbl>,
#   AgeGroup <chr>
#
# Edge Data: 9,986 x 4
   from    to when        Weight
  <int> <int> <chr>        <int>
1     1   505 Working-day    127
2     1   582 Working-day    113
3     1   636 Working-day    101
# ... with 9,983 more rows

4.2.1 Social Network- Working/weekends & EducationLevel

hide
set_graph_style() 
g <- ggraph(graph_work, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = EducationLevel), 
                  size = 0.4)
g
hide
g + facet_edges(~when) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

hide
g + facet_nodes(~EducationLevel) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

There are more connections - more social activities in working days.

4.2.2 Social Network - Working/Weekends & AgeGroup

hide
set_graph_style()
g1 <- ggraph(graph_work, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = AgeGroup), 
                  size = 0.4) +
  theme(legend.position = 'bottom')
g1
hide
g1 + facet_edges(~when) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

hide
g1 + facet_nodes(~AgeGroup) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

5. Conclusion

What i learned: - create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph, - build network graph visualisation using appropriate functions of ggraph, - compute network geometrics using tidygraph, - build advanced graph visualisation by incorporating the network geometrics, and - build interactive network visualisation using visNetwork package.