1. Overview

In this project, we need to reveal the patterns of community interactions of the city of Engagement, Ohio USA by using social network analysis approach.

Processes include: - create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph, - build network graph visualisation using appropriate functions of ggraph, - compute network geometrics using tidygraph, - build advanced graph visualisation by incorporating the network geometrics, and - build interactive network visualisation using visNetwork package.

2. Required libraries and datasets

2.1 Load required packages

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

Here four network data modelling and visualisation packages will be installed and launched. They are igraph, tidygraph, ggraph and visNetwork. Beside these four packages, tidyverse and lubridate, an R package specially designed to handle and wrangling time data will be installed and launched too.

The chunk code on the right will do the trick.

hide

packages = c('igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'tidyverse', 'graphlayouts')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Importing Data

There are two data sets needed. One contains the nodes data and the other contains the edges (also know as link) data.

In this step, we will import .csv into RStudio environment by using read_csv() of readr package.

hide

edges <- read_csv("data/SocialNetwork.csv")
nodes <- read_csv("data/Participants.csv")

It is always a good practice to examine the imported data frame before further analysis is performed.

Let’s take an overview of the datasets, we will examine the structure of the data frame using glimpse() of dplyr.

hide

glimpse(edges)
glimpse(nodes)

3. Data Wrangling

3.1 Preprocess the Edges Data

We will aggregate the participants by from, to, date, weekday, week and the working-day/weekends.

hide

workdays <- c('Monday','Tuesday','Wednesday','Thursday','Friday')

edges <- edges %>% 
  mutate(from = participantIdFrom,
         to = participantIdTo) %>% 
  mutate(weekday = wday(timestamp,
                        label = TRUE,
                        abbr = FALSE)) %>% 
  mutate(month = month(timestamp,
                       label = FALSE)) %>%
  mutate(week = lubridate::week(timestamp)) %>%
  mutate(when = case_when(weekday %in% workdays ~ "Working-day",
                          TRUE ~ "Weekends")) %>% 
  select(from,to,timestamp,weekday,week,when)

3.2 Preprocess the Nodes Data

- Rename ‘HighSchoolOrCollege’

hide

nodes$educationLevel <- sub('HighSchoolOrCollege',
                            'High School or College',
                            nodes$educationLevel)

- Rename columns

hide

nodes <- nodes %>% 
  rename('ID' = 'participantId', 
         'HouseholdSize' = 'householdSize', 
         'HaveKids' = 'haveKids', 
         'Age' = 'age', 
         'EducationLevel' = 'educationLevel', 
         'InterestGroup' = 'interestGroup', 
         'Joviality' = 'joviality')

- Age Binning

Age variable is binned with the following code chunk:

hide

breaks <- c(17, 20, 25, 30, 35, 40, 45, 50, 55, 60)
groups <- c('20 & Below', '21-25', '26-30',
            '31-35', '36-40', '41-45',
            '46-50', '51-55', '56-60')
nodes$AgeGroup <- cut(nodes$Age, 
                      breaks=breaks, 
                      labels = groups)

- Write and Read rds Files

hide

saveRDS(edges, "data/rds/edges.rds")
saveRDS(nodes, "data/rds/nodes.rds")

hide

edges <- read_rds('data/rds/edges.rds')
nodes <- read_rds('data/rds/nodes.rds')

3.3 Network Data During Whole Period

Firstly, let’s have a look of the general social network situation during the whole investigated period.

- The Edges Data

Four functions from dplyr package are used here. They are: filter(), group(), summarise(), and ungroup().

The output data.frame is called edges_aggregated. A new field called Weight has been added in edges_aggregated.

hide

edges_aggregated <- edges %>% 
  group_by(from,to) %>% 
  summarise(Weight = n()) %>% 
  filter(from!=to) %>% 
  filter(Weight > 1) %>% 
  ungroup()
edges_aggregated

# A tibble: 160,842 x 3
    from    to Weight
   <dbl> <dbl>  <int>
 1     0    27     14
 2     0    33     16
 3     0    37     20
 4     0    53     15
 5     0    72     16
 6     0   123     17
 7     0   226     16
 8     0   300     18
 9     0   316     34
10     0   341     17
# ... with 160,832 more rows

How is the distribution of the social network?

hide

hist(edges_aggregated$Weight)

We can observe that during this period the highest frequency one socialize with others is higher than 400 and as the count of edges increases, the frequency decreases, which means most people does not like to socialize with others.

So, let’s select edges with weight higher than 200 - those participants is more actively socialized during this period.

hide

edges_aggregated <- edges_aggregated %>% 
  filter(Weight >200)

Now, we have 3 columns and 4210 records of socialization.

hide

glimpse(edges_aggregated)

Rows: 4,210
Columns: 3
$ from   <dbl> 2, 2, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6,~
$ to     <dbl> 69, 176, 29, 92, 113, 221, 882, 953, 82, 136, 203, 10~
$ Weight <int> 244, 279, 206, 355, 315, 397, 209, 383, 276, 357, 275~

- The Nodes Data

For the edges data, we need to find those participants who have been in the 4210 records we selected.

hide

nodes_aggregated <- nodes %>%
  filter (ID %in% c(edges_aggregated$from,
                    edges_aggregated$to))

3.4 Network Data in Working-days/weekends

Same functions are used here to get the network fro both working days and weekends.

- The Edges Data

hide

edges_byWork <- edges %>% 
  group_by(from,to,when) %>% 
  summarise(Weight = n()) %>% 
  filter(from!=to) %>% 
  filter(Weight > 1) %>% 
  ungroup()
edges_byWork

# A tibble: 317,900 x 4
    from    to when        Weight
   <dbl> <dbl> <chr>        <int>
 1     0    27 Weekends         4
 2     0    27 Working-day     10
 3     0    33 Weekends         4
 4     0    33 Working-day     12
 5     0    37 Weekends         5
 6     0    37 Working-day     15
 7     0    53 Weekends         5
 8     0    53 Working-day     10
 9     0    72 Weekends         4
10     0    72 Working-day     12
# ... with 317,890 more rows

Code chunk below gives the distribution of participants’ socialize situation in working-days and weekends.

hide

hist(edges_byWork$Weight)

We can observe that the frequency of socialization of majority participants is less than 100. We then have a look at those with weights more than 100.

hide

edges_byWork <- edges_byWork %>% 
  filter(Weight >100)

hide

glimpse(edges_byWork)

Rows: 9,986
Columns: 4
$ from   <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4,~
$ to     <dbl> 555, 675, 731, 69, 136, 149, 176, 202, 218, 412, 660,~
$ when   <chr> "Working-day", "Working-day", "Working-day", "Working~
$ Weight <int> 127, 113, 101, 176, 106, 102, 198, 125, 125, 103, 105~

- The Nodes Data

For the edges data, we need to find those participants who have been in the 9986 records we selected.

hide

nodes_work <- nodes %>%
  filter (ID %in% c(edges_byWork$from,
                    edges_byWork$to))

4. Visulisation and Insights

4.1 High-weightage Network During Whole Period

We will create a graph data model by using tidygraph package. It provides a tidy API for graph/network manipulation. While network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.

Two functions of tidygraph package can be used to create network objects, they are:

tbl_graph() creates a tbl_graph network object from nodes and edges data.
as_tbl_graph() converts network data and objects to a tbl_graph network

hide

graph_aggregated <- graph_from_data_frame (edges_aggregated,
                                           vertices = nodes_aggregated) %>%
  as_tbl_graph()

graph_aggregated

# A tbl_graph: 763 nodes and 4210 edges
#
# A directed simple graph with 24 components
#
# Node Data: 763 x 8 (active)
  name  HouseholdSize HaveKids   Age EducationLevel InterestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 2                 3 TRUE        35 High School o~ A            
2 4                 3 TRUE        43 Bachelors      H            
3 5                 3 TRUE        32 High School o~ D            
4 6                 3 TRUE        26 High School o~ I            
5 7                 3 TRUE        27 Bachelors      A            
6 8                 3 TRUE        20 Bachelors      G            
# ... with 757 more rows, and 2 more variables: Joviality <dbl>,
#   AgeGroup <chr>
#
# Edge Data: 4,210 x 3
   from    to Weight
  <int> <int>  <int>
1     1    56    244
2     1   146    279
3     2    24    206
# ... with 4,207 more rows

We will plot Network Data with ggraph package - an extension of ggplot2, making it easier to carry over basic ggplot skills to the design of network graphs.

As in all network graph, there are three main aspects to a ggraph’s network graph, they are:nodes,edges and layouts.

hide

set_graph_style()
g <- ggraph(graph_aggregated, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = EducationLevel), 
                  size = 0.4) +
  theme(legend.position = 'bottom')

g

hide

g + facet_nodes(~EducationLevel) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

We can detect that there’re mainly 3 most significant network clusters.

It seems that participants with High School or College education level are more active in social activities.

hide

set_graph_style()
g <- ggraph(graph_aggregated, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = InterestGroup), 
                  size = 0.4) +
  theme(legend.position = 'bottom')
g

hide

g + facet_nodes(~InterestGroup) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

We can detect that there are also mainly 3 most significant network clusters.

Participants in different interest group perform differently in different clusters in terms their activeness in joining social activities.

hide

set_graph_style()
g <- ggraph(graph_aggregated, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = AgeGroup), 
                  size = 0.4) +
  theme(legend.position = 'bottom')
g

hide

g + facet_nodes(~AgeGroup) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

We can detect that there are also mainly 3 most significant network clusters.

Participants in different age groups perform differently in 3 clusters in terms their activeness in joining social activities.

4.2 Network during Working-days and Weekends

According to whether it’s working day or weekends.

hide

graph_work <- graph_from_data_frame (edges_byWork,
                                           vertices = nodes_work) %>%
  as_tbl_graph()

graph_work

# A tbl_graph: 834 nodes and 9986 edges
#
# A directed multigraph with 17 components
#
# Node Data: 834 x 8 (active)
  name  HouseholdSize HaveKids   Age EducationLevel InterestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 1                 3 TRUE        25 High School o~ B            
2 2                 3 TRUE        35 High School o~ A            
3 3                 3 TRUE        21 High School o~ I            
4 4                 3 TRUE        43 Bachelors      H            
5 5                 3 TRUE        32 High School o~ D            
6 6                 3 TRUE        26 High School o~ I            
# ... with 828 more rows, and 2 more variables: Joviality <dbl>,
#   AgeGroup <chr>
#
# Edge Data: 9,986 x 4
   from    to when        Weight
  <int> <int> <chr>        <int>
1     1   505 Working-day    127
2     1   582 Working-day    113
3     1   636 Working-day    101
# ... with 9,983 more rows

hide

set_graph_style() 
g <- ggraph(graph_work, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = EducationLevel), 
                  size = 0.4)
g

hide

g + facet_edges(~when) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

hide

g + facet_nodes(~EducationLevel) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

There are more connections - more social activities in working days.

hide

set_graph_style()
g1 <- ggraph(graph_work, 
            layout = "nicely") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.01, .5)) +
  geom_node_point(aes(colour = AgeGroup), 
                  size = 0.4) +
  theme(legend.position = 'bottom')
g1

hide

g1 + facet_edges(~when) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

hide

g1 + facet_nodes(~AgeGroup) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

5. Conclusion

What i learned: - create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph, - build network graph visualisation using appropriate functions of ggraph, - compute network geometrics using tidygraph, - build advanced graph visualisation by incorporating the network geometrics, and - build interactive network visualisation using visNetwork package.

Take-home Exercise 6

1. Overview

2. Required libraries and datasets

2.1 Load required packages

2.2 Importing Data

3. Data Wrangling

3.1 Preprocess the Edges Data

3.2 Preprocess the Nodes Data

3.3 Network Data During Whole Period

3.4 Network Data in Working-days/weekends

4. Visulisation and Insights

4.1 High-weightage Network During Whole Period

4.1.1 Social Network between Different Education Level

4.1.2 Social Network between Different Interest Group

4.1.3 Social Network between Different Age Group

4.2 Network during Working-days and Weekends

4.2.1 Social Network- Working/weekends & EducationLevel

4.2.2 Social Network - Working/Weekends & AgeGroup

5. Conclusion