This is an updated and extended version of the notebook used at the 2019 Social Networks and Health Workshop, now including (almost-)native R abilities to handle resolution parameters in modularity-like community detection and multilayer networks.
In opening, I want to acknowledge that none of this updated and extended notebook could have happened without the efforts of Vincent Traag, one of the developers of the Leiden algorithm, to develop a version of it for R-igraph. It isn’t in R-igraph 1.2.6, but it is in the development branch and apparently will be part of 1.2.7.
We are using 1.2.6.9118 in building this notebook.
library(igraph)
Attaching package: ‘igraph’
The following objects are masked from ‘package:stats’:
decompose, spectrum
The following object is masked from ‘package:base’:
union
help(package="igraph")
The help window should now display the igraph package version.
If you are using 1.2.7, the notebook below should work. If you are using 1.2.6 or earlier, you will need to update. At the time of this notebook, the cluster_leiden() function is only available in the development branch (see my slides for quick-hit installation overview).
You can follow the discussion about what parts of the C-igraph library are ported to R-igraph, as well as other options for calling Leiden from R, at https://github.com/vtraag/leidenalg/issues/59 and https://github.com/TomKellyGenetics/leiden/issues/1. (Continued thanks to Tom Kelly for giving us the reticulate-enabled package we called last time around in 2019.)
We also gratefully acknowledge that many parts of this notebook are informed by Statistical Analysis of Network Data with R, especially Section 4.4 (and see also Section 6.3 about stochastic block models, not covered here today).
Again, community detection is a huge subfield of network science. If you are interested in the topic in general, please consult the multiple review articles.
In the examples in our 2019 lecture and notebook, we made a repeated point about the apparent absence of native-to-R implementation of the Reichardt-Bornholdt “gamma” resolution parameter for modularity.
The (more) careful (than me) reader of the documentation might notice that there was one option available there all along: the cluster_spinglass() function does indeed have a gamma parameter, and even a gamma.minus for handling negative weights. I encourage you to try it if it works on your problem. My intuition is that the simulated annealing approach used in this algorithm is often much slower than other available routines, but it is another good function to know about and have available to you for comparison.
For the purposes of this notebook, we will focus on the new cluster_leiden().
?cluster_leiden
Note in particular that the default in cluster_leiden() is to use the constant Potts model (CPM) by default. I’ll admit that I’m biased by my own trajectory in that I’m going to focus on using modularity instead of CPM here. I do not believe one is necessarily “better” than the other. They have different behaviors, especially so in the presence of increased variance in the (possibly weighted) degree distribution, and which behavior you want might reasonably and understandably vary with the problem you want to address. Keep in mind throughout that at the end of the day these are all just different ways of clustering network data.
“If your method doesn’t work on this network, then go home.”
– Eric Kolaczyk, at the opening workshop of the 2010-11 SAMSI program year on Complex Networks
library(igraphdata)
data(karate)
print_all(karate)
IGRAPH 4b458a1 UNW- 34 78 -- Zachary's karate club network
+ attr: name (g/c), Citation (g/c), Author (g/c), Faction (v/n), name (v/c),
| label (v/c), color (v/n), weight (e/n)
+ edges (vertex names):
Mr Hi -- Actor 2, Actor 3, Actor 4, Actor 5, Actor 6, Actor 7, Actor 8, Actor 9,
Actor 11, Actor 12, Actor 13, Actor 14, Actor 18, Actor 20, Actor
22, Actor 32
Actor 2 -- Mr Hi, Actor 3, Actor 4, Actor 8, Actor 14, Actor 18, Actor 20, Actor
22, Actor 31
Actor 3 -- Mr Hi, Actor 2, Actor 4, Actor 8, Actor 9, Actor 10, Actor 14, Actor
28, Actor 29, Actor 33
Actor 4 -- Mr Hi, Actor 2, Actor 3, Actor 8, Actor 13, Actor 14
Actor 5 -- Mr Hi, Actor 7, Actor 11
Actor 6 -- Mr Hi, Actor 7, Actor 11, Actor 17
Actor 7 -- Mr Hi, Actor 5, Actor 6, Actor 17
Actor 8 -- Mr Hi, Actor 2, Actor 3, Actor 4
Actor 9 -- Mr Hi, Actor 3, Actor 31, Actor 33, John A
Actor 10 -- Actor 3, John A
Actor 11 -- Mr Hi, Actor 5, Actor 6
Actor 12 -- Mr Hi
Actor 13 -- Mr Hi, Actor 4
Actor 14 -- Mr Hi, Actor 2, Actor 3, Actor 4, John A
Actor 15 -- Actor 33, John A
Actor 16 -- Actor 33, John A
Actor 17 -- Actor 6, Actor 7
Actor 18 -- Mr Hi, Actor 2
Actor 19 -- Actor 33, John A
Actor 20 -- Mr Hi, Actor 2, John A
Actor 21 -- Actor 33, John A
Actor 22 -- Mr Hi, Actor 2
Actor 23 -- Actor 33, John A
Actor 24 -- Actor 26, Actor 28, Actor 30, Actor 33, John A
Actor 25 -- Actor 26, Actor 28, Actor 32
Actor 26 -- Actor 24, Actor 25, Actor 32
Actor 27 -- Actor 30, John A
Actor 28 -- Actor 3, Actor 24, Actor 25, John A
Actor 29 -- Actor 3, Actor 32, John A
Actor 30 -- Actor 24, Actor 27, Actor 33, John A
Actor 31 -- Actor 2, Actor 9, Actor 33, John A
Actor 32 -- Mr Hi, Actor 25, Actor 26, Actor 29, Actor 33, John A
Actor 33 -- Actor 3, Actor 9, Actor 15, Actor 16, Actor 19, Actor 21, Actor 23,
Actor 24, Actor 30, Actor 31, Actor 32, John A
John A -- Actor 9, Actor 10, Actor 14, Actor 15, Actor 16, Actor 19, Actor 20,
Actor 21, Actor 23, Actor 24, Actor 27, Actor 28, Actor 29, Actor
30, Actor 31, Actor 32, Actor 33
plot(karate)
If we call cluster_leiden() with default options, it uses CPM and sets the resolution_parameter to 1, which isn’t a particularly good value most of the time for CPM. Vincent Traag has a nice concise discussion about this and trying to nudge users to be forced to pick a resolution for CPM. As he notes, a better default value would be closer to the (weighted) density of the graph. But that would then encourage people to stick to the default rather than explore the impact of changing the parameter. It only works (sort of) here because the karate data set here is weighted.
kc <- cluster_leiden(karate)
length(kc)
[1] 16
sizes(kc)
Community sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
6 2 3 6 1 1 1 1 1 1 1 1 1 5 2 1
membership(kc)
Mr Hi Actor 2 Actor 3 Actor 4 Actor 5 Actor 6 Actor 7 Actor 8 Actor 9 Actor 10
1 1 1 1 2 3 3 1 4 5
Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16 Actor 17 Actor 18 Actor 19 Actor 20
2 6 7 1 8 4 3 9 10 11
Actor 21 Actor 22 Actor 23 Actor 24 Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30
12 13 4 14 14 14 15 14 16 15
Actor 31 Actor 32 Actor 33 John A
4 14 4 4
plot(kc,karate)
kc.cpm1 <- kc #In case I need this later
For comparison, we compute the density and weighted density of the karate club
n <- length(V(karate)) #number vertices
graph.density(karate)
[1] 0.1390374
length(E(karate))/(n*(n-1)/2)
[1] 0.1390374
sum(E(karate)$weight)/(n*(n-1)/2)
[1] 0.4117647
and we confirm that we get a smaller number of larger communities using Leiden with CPM at the weighted density
kc <- cluster_leiden(karate,
resolution_parameter = sum(E(karate)$weight)/(n*(n-1)/2))
length(kc)
[1] 10
sizes(kc)
Community sizes
1 2 3 4 5 6 7 8 9 10
9 5 13 1 1 1 1 1 1 1
membership(kc)
Mr Hi Actor 2 Actor 3 Actor 4 Actor 5 Actor 6 Actor 7 Actor 8 Actor 9 Actor 10
1 1 1 1 2 2 2 1 3 4
Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16 Actor 17 Actor 18 Actor 19 Actor 20
2 5 1 1 3 3 2 6 7 1
Actor 21 Actor 22 Actor 23 Actor 24 Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30
8 1 9 3 3 3 3 3 10 3
Actor 31 Actor 32 Actor 33 John A
3 3 3 3
plot(kc,karate)
kc.cpmdens <- kc
We remind you that one of the most powerful things you can do to understand your community assignments, especially if the number of communities is not too big, is to look at contingency tables (or crosstabs or confusion matrices, the word choice depending on the setting).
table(membership(kc.cpm1),membership(kc.cpmdens))
1 2 3 4 5 6 7 8 9 10
1 6 0 0 0 0 0 0 0 0 0
2 0 2 0 0 0 0 0 0 0 0
3 0 3 0 0 0 0 0 0 0 0
4 0 0 5 0 0 0 0 0 1 0
5 0 0 0 1 0 0 0 0 0 0
6 0 0 0 0 1 0 0 0 0 0
7 1 0 0 0 0 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 0
9 0 0 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 1 0 0 0
11 1 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 1 0 0
13 1 0 0 0 0 0 0 0 0 0
14 0 0 5 0 0 0 0 0 0 0
15 0 0 2 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 1
Before switching over from CPM to modularity, it might be instructive to see what happens with the default cluster_leiden() options on the unweighted version of the karate club network:
karate.unweighted <- delete_edge_attr(karate,"weight")
plot(karate.unweighted)
kc <- cluster_leiden(karate.unweighted)
length(kc)
[1] 34
sizes(kc)
Community sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
31 32 33 34
1 1 1 1
membership(kc)
Mr Hi Actor 2 Actor 3 Actor 4 Actor 5 Actor 6 Actor 7 Actor 8 Actor 9 Actor 10
1 2 3 4 5 6 7 8 9 10
Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16 Actor 17 Actor 18 Actor 19 Actor 20
11 12 13 14 15 16 17 18 19 20
Actor 21 Actor 22 Actor 23 Actor 24 Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30
21 22 23 24 25 26 27 28 29 30
Actor 31 Actor 32 Actor 33 John A
31 32 33 34
plot(kc,karate)
The above is the expected behavior on an unweighted network like this. Why?
Recall that it is now well established that modularity suffers from a resolution limit in the sense that increasing the size of the network overall can lead to larger communities even in situations where that obviously shouldn’t happen. This is frankly just another variant of the age old question of how to select the number of groups when clustering data: the default resolution parameter gamma=1 is indirectly choosing the number of groups, while smaller (larger) gamma will tend to choose smaller (larger) numbers of groups on the same network data set. By not specifying gamma you’re still making a choice to use the default value.
In the approach we used in our 2019 notebook, different scales of resolution of community structure were accessed by adding self-loops to the network, as proposed by Arenas, Fernandez and Gomez.
Another fix for this is the gamma multiplicative factor on the null model introduced by Reichardt and Bornholdt, which is in my experience typically what people mean when they talk about the “resolution parameter” in modularity-based community detection. Since this is the more common approach, and since we already talked about the self-loop approach previously, the rest of this document is in terms of the gamma introduced by Reichardt and Bornholdt.
This of course then leads to a question about how to pick gamma. If you’re interested, take a look at the “CHAMP” paper and python code, and the more recent “ModularityPruning” python code by Ryan Gibson (manuscript in preparation), which combines CHAMP with the approach by Pamfil et al. These will all be discussed in the “beyond” part of the lecture.
What happens when we explore across different values of the gamma resolution parameter?
gamma <- seq(0.25,2,0.025)
nc <- vector("numeric",length(gamma))
for (i in 1:length(gamma)){
gc <- cluster_leiden(karate, objective_function = "modularity",
n_iterations = 3, resolution_parameter = gamma[i])
nc[i] <- length(gc)
}
plot(gamma,nc,xlab="gamma",ylab="# Communities",main="Zachary Karate Club")
I often advocate letting the data “tell” you where “good” or “robust” values of gamma might be. In this case, the above plot doesn’t do anything to make me prioritize the 2-, 3-, 4- or 5-community partitions relative to each other. Importantly, I don’t even know if we only found a single partition for each number of communities (indeed, we very likely did not). You could automate the comparison of different partitions relative to one another, but since this brings up issues related to the “and beyond” part of the lecture, I’m going to table those issues here and instead just compare a few of the Leiden results at different resolution parameters by way of contingency tables.
kc2 <- cluster_leiden(karate, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 0.5)
plot(kc2,karate)
table(V(karate)$Faction,membership(kc2))
1 2
1 16 0
2 0 18
kc3 <- cluster_leiden(karate, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 0.75)
plot(kc3,karate)
table(membership(kc2),membership(kc3))
1 2 3
1 11 5 0
2 0 0 18
kc4 <- cluster_leiden(karate, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 1.0)
plot(kc4,karate)
table(membership(kc3),membership(kc4))
1 2 3 4
1 11 0 0 0
2 0 5 0 0
3 0 0 12 6
kc5 <- cluster_leiden(karate, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 1.5)
plot(kc5,karate)
table(membership(kc4),membership(kc5))
1 2 3 4 5
1 5 6 0 0 0
2 5 0 0 0 0
3 0 0 10 0 2
4 0 0 0 6 0
Where this all ecomes even more useful is with larger networks. We pull up the Enron email network. We also used this data in the 2019 notebook, where we had to push it around from its original multigraph form into a weighed graph in order to add self-loop edge weights there. Because we don’t need the self-loop edge weights in the present notebook, we will not need to do that here; but cluster_leiden() only works for undirected graphs, so we still make that change here.
data(enron)
summary(enron)
IGRAPH 64ec693 D--- 184 125409 -- Enron email network
+ attr: LDC_names (g/c), LDC_desc (g/c), name (g/c), Citation (g/c), Email
| (v/c), Name (v/c), Note (v/c), Time (e/c), Reciptype (e/c), Topic (e/n),
| LDC_topic (e/n)
enron <- as.undirected(enron,mode="each")
summary(enron)
IGRAPH aa399a1 U--- 184 125409 -- Enron email network
+ attr: LDC_names (g/c), LDC_desc (g/c), name (g/c), Citation (g/c), Email
| (v/c), Name (v/c), Note (v/c), Time (e/c), Reciptype (e/c), Topic (e/n),
| LDC_topic (e/n)
We now run community detection varying the resolution parameter.
gamma <- seq(0.2,4,0.04)
nc <- vector("numeric",length(gamma))
for (i in 1:length(gamma)){
gc <- cluster_leiden(enron, objective_function = "modularity",
n_iterations = 3, resolution_parameter = gamma[i])
nc[i] <- length(gc)
}
plot(gamma,nc,xlab="gamma",ylab="# Communities",main="Enron Email")
Let’s contrast results at gamma=2.2 and 3.2: the above results for small self-loop weights with those near the value 100:
ec22 <- cluster_leiden(enron, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 2.2)
ec32 <- cluster_leiden(enron, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 3.2)
table(membership(ec22),membership(ec32))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 21 2 0 0 0 0 0 0 0 1 6 0 2 0 0 0 0 0 0
3 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 10 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0
6 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
10 0 1 1 0 0 0 0 0 0 10 16 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 2 0 0 0 7 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
Another favorite example for demonstrating the need for a resolution parameter is the network of (then-so-called) Division IA college football games from the Fall 2000 season. This example goes back to M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). The GML file can be downloaded (along with a bunch of other cool examples) from Mark Newman’s web page http://www-personal.umich.edu/~mejn/netdata/.
football <- read_graph("football.gml",format="gml")
print(football)
IGRAPH b961aa2 U--- 115 613 --
+ attr: id (v/n), label (v/c), value (v/n)
+ edges from b961aa2:
[1] 1-- 2 3-- 4 1-- 5 5-- 6 4-- 6 3-- 7 7-- 8 8-- 9 9--10 1--10 5--10 6--11
[13] 11--12 6--12 4--12 13--14 3--14 3--15 13--15 15--16 14--16 3--16 5--17 10--17
[25] 1--17 17--18 13--18 13--19 19--20 18--21 21--22 9--22 8--22 10--23 8--23 22--23
[37] 9--23 23--24 10--24 5--24 17--24 1--24 12--25 25--26 2--26 4--27 13--27 15--27
[49] 27--28 18--28 2--28 5--29 12--29 25--29 20--30 30--31 20--31 19--32 32--33 22--33
[61] 16--33 14--33 7--33 1--34 2--34 26--34 20--34 32--35 27--35 13--35 19--35 35--36
[73] 1--36 30--36 20--36 31--36 19--37 13--37 21--37 20--37 37--38 2--38 26--38 34--38
[85] 19--39 17--39 29--39 27--39 15--39 13--39 39--40 7--40 33--40 14--40 16--40 8--41
+ ... omitted several edges
Note that the graph includes vertex attributes for “id”, “label”, and “value”. The “id” appears to only be a re-indexing of the 115 vertices starting from 0. The “label” is the name of the university. The “value” indexes which of the conferences the university is associated with, for example
V(football)$label[V(football)$value==0]
[1] "FloridaState" "NorthCarolinaState" "Virginia" "GeorgiaTech"
[5] "Duke" "NorthCarolina" "Clemson" "WakeForest"
[9] "Maryland"
picks out the 9 universities that were then associated with the ACC and
V(football)$label[V(football)$value==3]
[1] "KansasState" "TexasTech" "Baylor" "Colorado" "Kansas"
[6] "IowaState" "Nebraska" "TexasA&M" "Oklahoma" "Texas"
[11] "Missouri" "OklahomaState"
identifies the 12 universities in the Big 12 Conference back when it actually had 12 universities. The Big 12 now has 10 and the Big 10 now has either 14 or 16 depending on whether you count the affiliates; if you don’t follow college sports, don’t worry.
What happens if we simply optimize modularity at the default gamma=1?
fc <- cluster_leiden(football, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 1.0)
plot(fc,football,vertex.label=V(fg)$value)
membership(fc)
[1] 1 2 3 4 1 4 3 5 5 1 4 1 6 3 6 3 1 7 6 8 7 5 5 1 1 2 6 7
[29] 1 8 8 6 3 2 6 8 6 2 6 3 4 1 6 6 9 2 10 3 9 10 1 5 4 10 6 8
[57] 7 9 7 7 3 6 7 7 3 7 9 10 5 1 7 6 4 10 4 9 7 5 5 8 8 4 8 10
[85] 4 6 9 7 10 2 1 9 9 1 8 7 7 7 4 6 3 8 4 2 1 2 3 4 5 2 10 5
[113] 9 7 10
sizes(fc)
Community sizes
1 2 3 4 5 6 7 8 9 10
14 9 11 12 10 15 16 10 9 9
table(V(fg)$value,membership(fc))
1 2 3 4 5 6 7 8 9 10
0 0 9 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 8 0 0
2 0 0 11 0 0 0 0 0 0 0
3 0 0 0 12 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 9 1
5 1 0 0 0 0 2 0 2 0 0
6 0 0 0 0 0 13 0 0 0 0
7 8 0 0 0 0 0 0 0 0 0
8 0 0 0 0 10 0 0 0 0 0
9 0 0 0 0 0 0 12 0 0 0
10 4 0 0 0 0 0 3 0 0 0
11 1 0 0 0 0 0 1 0 0 8
The above table shows that there is certainly strong alignment between the conference values and the community assignments, but it also has a few quirks.
As an aside, normally I would strongly advocate using cluster_optimal() for a small-enough network if I was only using gamma=1 anyway, but I run into this problem now: https://github.com/igraph/rigraph/issues/273.
Now we try to sweep over gamma to try to let the data tell us where it should be clustered:
gamma <- seq(0.25,5,0.05)
nc <- vector("numeric",length(gamma))
for (i in 1:length(gamma)){
gc <- cluster_leiden(football, objective_function = "modularity",
n_iterations = 3, resolution_parameter = gamma[i])
nc[i] <- length(gc)
}
plot(gamma,nc,xlab="gamma",ylab="# Communities",main="American College Football")
The way I read the above plot, it is all but screaming out at me that there “should” be 12 communities identified in this data, not 10.
fc12 <- cluster_leiden(football, objective_function = "modularity",
n_iterations = 3, resolution_parameter = 2.5)
plot(fc12,football,vertex.label=V(fg)$value)
membership(fc12)
[1] 1 2 3 4 1 4 3 5 5 1 4 6 7 3 7 3 1 8 7 9 8 5 5 1 6 2 7 8
[29] 6 9 9 7 3 2 7 9 10 2 7 3 4 1 7 7 11 2 12 3 11 12 6 5 4 12 7 9
[57] 8 11 10 10 3 7 8 10 3 8 11 12 5 6 8 7 4 12 4 11 8 5 5 9 9 4 9 12
[85] 4 7 11 8 12 2 6 11 11 1 9 8 8 10 4 7 3 9 4 2 1 2 3 4 5 2 12 5
[113] 11 8 12
sizes(fc12)
Community sizes
1 2 3 4 5 6 7 8 9 10 11 12
8 9 11 12 10 6 14 12 10 5 9 9
table(V(fg)$value,membership(fc12))
1 2 3 4 5 6 7 8 9 10 11 12
0 0 9 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 8 0 0 0
2 0 0 11 0 0 0 0 0 0 0 0 0
3 0 0 0 12 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 9 1
5 0 0 0 0 0 1 1 0 2 1 0 0
6 0 0 0 0 0 0 13 0 0 0 0 0
7 8 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 10 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 12 0 0 0 0
10 0 0 0 0 0 4 0 0 0 3 0 0
11 0 0 0 0 0 1 0 0 0 1 0 8
table(membership(fc),membership(fc12))
1 2 3 4 5 6 7 8 9 10 11 12
1 8 0 0 0 0 6 0 0 0 0 0 0
2 0 9 0 0 0 0 0 0 0 0 0 0
3 0 0 11 0 0 0 0 0 0 0 0 0
4 0 0 0 12 0 0 0 0 0 0 0 0
5 0 0 0 0 10 0 0 0 0 0 0 0
6 0 0 0 0 0 0 14 0 0 1 0 0
7 0 0 0 0 0 0 0 12 0 4 0 0
8 0 0 0 0 0 0 0 0 10 0 0 0
9 0 0 0 0 0 0 0 0 0 0 9 0
10 0 0 0 0 0 0 0 0 0 0 0 9
We can then investigate the above assignments to see if this makes more sense as 12 communities. For example,
V(football)$label[membership(fc12)==6]
[1] "NorthTexas" "ArkansasState" "BoiseState" "Idaho" "NewMexicoState"
[6] "UtahState"
appears to correctly pull out the old Big West as a separate community even though the “value” attribute doesn’t appear to flag it as such.
V(football)$value[membership(fc12)==6]
[1] 10 10 11 10 10 5
We close this part of the lecture by considering roll call similarities in the U.S. Senate for the 85th (1957-58) and 110th (2007-8) Congresses. These similarities have been quantified as the fraction of bills for which two senators voted the same way (yay or nay) among all bills for which they were both present and voting. As such, these are dense, weighted networks. All of the Congressional Roll Call data used here was extracted from VoteView.
The weighted roll call similarity networks directly read in here from the .csv files are as defined in
library(readr)
senate110adj <- read_csv("senate110.csv", col_names = FALSE)
── Column specification ──────────────────────────────────────────────────────────────────
cols(
.default = col_double()
)
ℹ Use `spec()` for the full column specifications.
senate110 <- graph_from_adjacency_matrix(as.matrix(senate110adj),
mode="undirected",weighted=TRUE)
senate110names <- read.delim("senate110names.txt",header=FALSE)
V(senate110)$name <- as.matrix(senate110names)
senate110
IGRAPH 50bb86d UNW- 102 5147 --
+ attr: name (v/c), weight (e/n)
+ edges from 50bb86d (vertex names):
[1] 1104970041 0ALABAMA 20001SESSIONS --1109465941 0ALABAMA 20001SHELBY
[2] 1104970041 0ALABAMA 20001SESSIONS --1104030081 0ALASKA 20001MURKOWSKI
[3] 1104970041 0ALABAMA 20001SESSIONS --1101210981 0ALASKA 20001STEVENS
[4] 1104970041 0ALABAMA 20001SESSIONS --1101542961 0ARIZONA 20001KYL
[5] 1104970041 0ALABAMA 20001SESSIONS --1101503961 0ARIZONA 20001MCCAIN
[6] 1104970041 0ALABAMA 20001SESSIONS --1104030142 0ARKANSA 10001PRYOR
[7] 1104970041 0ALABAMA 20001SESSIONS --1102930542 0ARKANSA 10001LINCOLN
[8] 1104970041 0ALABAMA 20001SESSIONS --1101501171 0CALIFOR 10001BOXER
+ ... omitted several edges
plot(senate110,vertex.label=NA,edge.width=E(senate110)$weight)
The above plot doesn’t appear to show a lot of polarization, which is perhaps surprising if one has, well, paid attention to American politics at all for the past 20 years. The shape of this network layout is basically because of the way we defined the edge weights as fractions of agreement, so Senators who only agree rarely are still connected here with some positive edge weight. One could certainly rescale the edge weights with sigmoid-like functions to make the resulting layout different. But since today is all about community detection, we jump straight to that instead.
c110louvain <- cluster_louvain(senate110)
modularity(c110louvain)
[1] 0.184462
plot(c110,senate110,vertex.label=NA,edge.width=E(senate110)$weight)
Returning a modularity around 0.18 might not sound very polarized, but again, the context of the network definition here in terms of similarities means that an absolutely perfectly polarized setting would have modularity 0.5, so 0.18 is relatively high compared to that theoretical maximum and historically speaking compared to other two-year Congress periods. (Again, see the Waugh et al. paper.)
For comparison, we check if Louvain and Leiden find anything different
c110 <- cluster_leiden(senate110, objective_function = "modularity")
table(membership(c110),membership(c110louvain))
1 2
1 50 0
2 0 52
Who are in these two communities?
c110$membership
[1] 1 1 1 1 1 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 2 1 1 2 1 1 1 1 1 2 1 2 2 2 2 2
[43] 2 2 2 1 1 1 1 2 1 2 2 1 2 1 2 1 1 2 2 2 1 2 2 1 1 2 2 2 1 1 1 1 2 2 1 2 2 1 1 1 2 1
[85] 1 1 1 1 1 2 2 2 1 2 2 2 2 2 2 1 1 1
V(senate110)$name[c110$membership==1]
[1] "1104970041 0ALABAMA 20001SESSIONS " "1109465941 0ALABAMA 20001SHELBY "
[3] "1104030081 0ALASKA 20001MURKOWSKI " "1101210981 0ALASKA 20001STEVENS "
[5] "1101542961 0ARIZONA 20001KYL " "1101503961 0ARIZONA 20001MCCAIN "
[7] "1102910862 0COLORAD 20001ALLARD " "1104050143 0FLORIDA 20001MARTINEZ "
[9] "1102951244 0GEORGIA 20001CHAMBLISS " "1102990944 0GEORGIA 20001ISAKSON "
[11] "1101480963 0IDAHO 20001CRAIG " "1102934563 0IDAHO 20001CRAPO "
[13] "1101450622 0INDIANA 20001LUGAR " "1101422631 0IOWA 20001GRASSLEY "
[15] "1102952332 0KANSAS 20001BROWNBACK " "1101485232 0KANSAS 20001ROBERTS "
[17] "1101540651 0KENTUCK 20001BUNNING " "1101492151 0KENTUCK 20001MCCONNELL "
[19] "1102991845 0LOUISIA 20001VITTER " "11049703 2 0MAINE 20001COLLINS "
[21] "1104030233 0MINNESO 20001COLEMAN " "1101400946 0MISSISS 20001COCHRAN "
[23] "1101403146 0MISSISS 20011LOTT " "1102953446 0MISSISS 20022WICKER "
[25] "1101550134 0MISSOUR 20001BOND " "1104970435 0NEBRASK 20001HAGEL "
[27] "1102953765 0NEVADA 20001ENSIGN " "11014826 4 0NEW HAM 20001GREGG "
[29] "11029740 4 0NEW HAM 20001SUNUNU " "1101410366 0NEW MEX 20001DOMENICI "
[31] "1102954847 0NORTH C 20001BURR " "1104030347 0NORTH C 20001DOLE "
[33] "1104990324 0OHIO 20001VOINOVICH " "1101542453 0OKLAHOM 20001INHOFE "
[35] "1102955553 0OKLAHOM 20001COBURN " "1104970572 0OREGON 20001SMITH GORD"
[37] "1101491014 0PENNSYL 20001SPECTER " "1102993648 0SOUTH C 20001DEMINT "
[39] "1102956648 0SOUTH C 20001GRAHAM " "1102975437 0SOUTH D 20001THUNE "
[41] "1104070554 0TENNESS 20001CORKER " "1104030454 0TENNESS 20001ALEXANDER "
[43] "1104030549 0TEXAS 20001CORNYN " "1104930649 0TEXAS 20001HUTCHISON "
[45] "1104930767 0UTAH 20001BENNETT " "1101450367 0UTAH 20001HATCH "
[47] "1101471240 0VIRGINI 20001WARNER " "1104970668 0WYOMING 20001ENZI "
[49] "1101563368 0WYOMING 20011THOMAS " "1104070768 0WYOMING 20022BARASSO "
V(senate110)$name[c110$membership==2]
[1] "1104030142 0ARKANSA 10001PRYOR " "1102930542 0ARKANSA 10001LINCOLN "
[3] "1101501171 0CALIFOR 10001BOXER " "1104930071 0CALIFOR 10001FEINSTEIN "
[5] "1104050062 0COLORAD 10001SALAZAR " "11014213 1 0CONNECT 10001DODD "
[7] "11015704 1 0CONNECT 10001LIEBERMAN " "1101410111 0DELAWAR 10001BIDEN "
[9] "1101501511 0DELAWAR 10001CARPER " "1101465143 0FLORIDA 10001NELSON "
[11] "1101440082 0HAWAII 10001AKAKA " "110 481282 0HAWAII 10001INOUYE "
[13] "1101502121 0ILLINOI 10001DURBIN " "1104050221 0ILLINOI 10001OBAMA "
[15] "1104990122 0INDIANA 10001BAYH " "1101423031 0IOWA 10001HARKIN "
[17] "1104970245 0LOUISIA 10001LANDRIEU " "11014661 2 0MAINE 20001SNOWE "
[19] "1101444052 0MARYLAN 10001MIKULSKI " "1101540852 0MARYLAN 10001CARDIN "
[21] "11010808 3 0MASSACH 10001KENNEDY ED" "11014920 3 0MASSACH 10001KERRY JOHN"
[23] "1102973223 0MICHIGA 10001STABENOW " "1101470923 0MICHIGA 10001LEVIN CARL"
[25] "1104070033 0MINNESO 10001KLOBUCHAR " "1104070134 0MISSOUR 10001MCCASKILL "
[27] "1101420364 0MONTANA 10001BAUCUS " "1104070264 0MONTANA 10001TESTER "
[29] "1104010335 0NEBRASK 10001NELSON BEN" "1101505465 0NEVADA 10001REID "
[31] "1102937312 0NEW JER 10001MENENDEZ " "1101491412 0NEW JER 10001LAUTENBERG "
[33] "1101491266 0NEW MEX 10001BINGAMAN " "1104010513 0NEW YOR 10001CLINTON "
[35] "1101485813 0NEW YOR 10001SCHUMER " "1101550236 0NORTH D 10001CONRAD "
[37] "1101481236 0NORTH D 10001DORGAN " "1102938924 0OHIO 10001BROWN "
[39] "1101487172 0OREGON 10001WYDEN " "1104070314 0PENNSYL 10001CASEY "
[41] "11040704 5 0RHODE I 10001WHITEHOUSE " "11029142 5 0RHODE I 10001REED "
[43] "1101542537 0SOUTH D 10001JOHNSON " "11029147 6 0VERMONT 32801SANDERS "
[45] "11014307 6 0VERMONT 10001LEAHY " "1104070640 0VIRGINI 10001WEBB "
[47] "1103931073 0WASHING 10001CANTWELL " "1104930873 0WASHING 10001MURRAY "
[49] "110 136656 0WEST VI 10001BYRD ROBER" "1101492256 0WEST VI 10001ROCKEFELLER"
[51] "1104930925 0WISCONS 10001FEINGOLD " "1101570325 0WISCONS 10001KOHL "
For contrast, here is the 85th Senate:
senate85adj <- read_csv("senate85.csv", col_names = FALSE)
── Column specification ──────────────────────────────────────────────────────────────────
cols(
.default = col_double()
)
ℹ Use `spec()` for the full column specifications.
senate85 <- graph_from_adjacency_matrix(as.matrix(senate85adj),
mode="undirected",weighted=TRUE)
senate85names <- read.delim("senate85names.txt",header=FALSE)
V(senate85)$name <- as.matrix(senate85names)
senate85
IGRAPH 9ed05a4 UNW- 101 5014 --
+ attr: name (v/c), weight (e/n)
+ edges from 9ed05a4 (vertex names):
[1] 85 876441 0ALABAMA 10001SPARKMAN --85 441841 0ALABAMA 10001HILL
[2] 85 876441 0ALABAMA 10001SPARKMAN --85 365861 0ARIZONA 20001GOLDWATER
[3] 85 876441 0ALABAMA 10001SPARKMAN --85 422761 0ARIZONA 10001HAYDEN
[4] 85 876441 0ALABAMA 10001SPARKMAN --85 615142 0ARKANSA 10001MCCLELLAN
[5] 85 876441 0ALABAMA 10001SPARKMAN --85 338842 0ARKANSA 10001FULBRIGHT
[6] 85 876441 0ALABAMA 10001SPARKMAN --85 537271 0CALIFOR 20001KUCHEL
[7] 85 876441 0ALABAMA 10001SPARKMAN --85 534371 0CALIFOR 20001KNOWLAND
[8] 85 876441 0ALABAMA 10001SPARKMAN --85 154062 0COLORAD 10001CARROLL
+ ... omitted several edges
plot(senate85,vertex.label=NA,edge.width=E(senate85)$weight)
c85louvain <- cluster_louvain(senate85)
modularity(c85louvain)
[1] 0.0912171
plot(c85louvain,senate85,vertex.label=NA,edge.width=E(senate85)$weight)
c85louvain$membership
[1] 1 1 2 1 1 1 2 2 1 2 2 2 1 2 1 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 1
[43] 1 1 1 1 1 1 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1
[85] 1 2 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2
V(senate85)$name[c85louvain$membership==1]
[1] "85 876441 0ALABAMA 10001SPARKMAN " "85 441841 0ALABAMA 10001HILL "
[3] "85 422761 0ARIZONA 10001HAYDEN " "85 615142 0ARKANSA 10001MCCLELLAN "
[5] "85 338842 0ARKANSA 10001FULBRIGHT " "85 154062 0COLORAD 10001CARROLL "
[7] "85 334911 0DELAWAR 10001FREAR " "85 452043 0FLORIDA 10001HOLLAND "
[9] "85 859343 0FLORIDA 10001SMATHERS " "85 813844 0GEORGIA 10001RUSSELL, R."
[11] "85 917444 0GEORGIA 10001TALMADGE " "85 172163 0IDAHO 10001CHURCH "
[13] "85 269121 0ILLINOI 10001DOUGLAS " "85 290145 0LOUISIA 10001ELLENDER "
[15] "85 576745 0LOUISIA 10001LONG, R. " "85 5180 3 0MASSACH 10001KENNEDY "
[17] "85 636623 0MICHIGA 10001MCNAMARA " "85 472833 0MINNESO 10001HUMPHREY "
[19] "85 888346 0MISSISS 10001STENNIS " "85 282246 0MISSISS 10001EASTLAND "
[21] "85 432834 0MISSOUR 10001HENNINGS " "85 914134 0MISSOUR 10001SYMINGTON "
[23] "85 682064 0MONTANA 10001MURRAY " "85 596764 0MONTANA 10001MANSFIELD "
[25] "85 594465 0NEVADA 20001MALONE " "85 68865 0NEVADA 10001BIBLE "
[27] "85 16566 0NEW MEX 10001ANDERSON " "85 167866 0NEW MEX 10001CHAVEZ "
[29] "85 489813 0NEW YOR 20001JAVITS " "85 298747 0NORTH C 10001ERVIN "
[31] "85 831347 0NORTH C 10011SCOTT " "85 507347 0NORTH C 10022JORDAN "
[33] "85 545236 0NORTH D 20001LANGER " "85 659753 0OKLAHOM 10001MONRONEY "
[35] "85 521353 0OKLAHOM 10001KERR " "85 688272 0OREGON 10001NEUBERGER "
[37] "85 673872 0OREGON 10001MORSE " "85 177614 0PENNSYL 10001CLARK, J. "
[39] "85 3783 5 0RHODE I 10001GREEN " "85 7229 5 0RHODE I 10001PASTORE "
[41] "85 500948 0SOUTH C 10001JOHNSTON " "859936948 0SOUTH C 10001THURMOND "
[43] "85 512254 0TENNESS 10001KEFAUVER " "85 370454 0TENNESS 10001GORE "
[45] "85 497949 0TEXAS 10001JOHNSON, L." "85 232649 0TEXAS 10011DANIEL "
[47] "85 77749 0TEXAS 10025BLAKLEY " "851041449 0TEXAS 10032YARBOROUGH "
[49] "85 485173 0WASHING 10001JACKSON " "85 591373 0WASHING 10001MAGNUSON "
[51] "85 685656 0WEST VI 10011NEELY " "85 763825 0WISCONS 10022PROXMIRE "
[53] "85 707068 0WYOMING 10001O'MAHONEY "
V(senate85)$name[c85louvain$membership==2]
[1] "85 365861 0ARIZONA 20001GOLDWATER " "85 537271 0CALIFOR 20001KUCHEL "
[3] "85 534371 0CALIFOR 20001KNOWLAND " "85 14262 0COLORAD 20001ALLOTT "
[5] "85 1329 1 0CONNECT 20001BUSH " "85 7656 1 0CONNECT 20001PURTELL "
[7] "851016311 0DELAWAR 20001WILLIAMS, J" "85 280363 0IDAHO 20001DWORSHAK "
[9] "85 260921 0ILLINOI 20001DIRKSEN " "85 492222 0INDIANA 20001JENNER "
[11] "85 148922 0INDIANA 20001CAPEHART " "85 438231 0IOWA 20001HICKENLOOPE"
[13] "85 603931 0IOWA 20001MARTIN, T. " "85 827032 0KANSAS 20001SCHOEPPEL "
[15] "85 151032 0KANSAS 20001CARLSON " "85 674551 0KENTUCK 20001MORTON "
[17] "85 205251 0KENTUCK 20001COOPER " "85 7271 2 0MAINE 20001PAYNE "
[19] "85 8666 2 0MAINE 20001SMITH, M. " "85 54652 0MARYLAN 20001BEALL "
[21] "85 134152 0MARYLAN 20001BUTLER " "85 8185 3 0MASSACH 20001SALTONSTALL"
[23] "85 755123 0MICHIGA 20001POTTER " "85 937433 0MINNESO 20001THYE "
[25] "85 466035 0NEBRASK 20001HRUSKA " "85 226735 0NEBRASK 20001CURTIS "
[27] "85 1027 4 0NEW HAM 20001BRIDGES " "85 2087 4 0NEW HAM 20001COTTON "
[29] "85 156912 0NEW JER 20001CASE, C. " "85 863512 0NEW JER 20001SMITH, H. "
[31] "85 483413 0NEW YOR 20001IVES " "851045036 0NORTH D 20001YOUNG, M. "
[33] "85 550024 0OHIO 10001LAUSCHE " "85 102424 0OHIO 20001BRICKER "
[35] "85 602114 0PENNSYL 20001MARTIN, E. " "85 679637 0SOUTH D 20001MUNDT "
[37] "85 157037 0SOUTH D 20001CASE, F. " "85 64567 0UTAH 20001BENNETT "
[39] "85 986567 0UTAH 20001WATKINS " "85 3206 6 0VERMONT 20001FLANDERS "
[41] "85 52 6 0VERMONT 20001AIKEN " "85 136540 0VIRGINI 10001BYRD "
[43] "85 795840 0VIRGINI 10001ROBERTSON " "85 782656 0WEST VI 20002REVERCOMB "
[45] "85 447356 0WEST VI 20025HOBLITZELL " "851011025 0WISCONS 20001WILEY "
[47] "85 613725 0WISCONS 20011MCCARTHY " "85 46468 0WYOMING 20001BARRETT "
Interestingly, Louvain and Leiden give different answers here at the default resolution:
c85 <- cluster_leiden(senate85, objective_function = "modularity")
table(membership(c85),membership(c85louvain))
1 2
1 40 0
2 0 45
3 13 3
And who is in this 3rd community?
V(senate85)$name[c85$membership==3]
[1] "85 615142 0ARKANSA 10001MCCLELLAN " "85 334911 0DELAWAR 10001FREAR "
[3] "85 452043 0FLORIDA 10001HOLLAND " "85 813844 0GEORGIA 10001RUSSELL, R."
[5] "85 917444 0GEORGIA 10001TALMADGE " "85 290145 0LOUISIA 10001ELLENDER "
[7] "85 888346 0MISSISS 10001STENNIS " "85 282246 0MISSISS 10001EASTLAND "
[9] "85 594465 0NEVADA 20001MALONE " "85 298747 0NORTH C 10001ERVIN "
[11] "85 507347 0NORTH C 10022JORDAN " "851045036 0NORTH D 20001YOUNG, M. "
[13] "859936948 0SOUTH C 10001THURMOND " "85 232649 0TEXAS 10011DANIEL "
[15] "85 136540 0VIRGINI 10001BYRD " "85 795840 0VIRGINI 10001ROBERTSON "
So, “should” there be 2 communities or 3 communities in the 85th Senate roll call?
gamma <- seq(0.75,1.25,0.01)
nc <- vector("numeric",length(gamma))
for (i in 1:length(gamma)){
gc <- cluster_leiden(senate85, objective_function = "modularity",
n_iterations = 3, resolution_parameter = gamma[i])
nc[i] <- length(gc)
}
plot(gamma,nc,xlab="gamma",ylab="# Communities",main="85th Senate Roll Call",log="y")