/* This program is designed to cluster large networks. The program uses the RNM strategy I
outlined in my 2001 Social Networks paper (http://www.sociology.ohio-state.edu/jwm/peergroups.pdf)
but adds a routine for adjusting the cluster membership to maximize the modularity score.
The "jiggle" routine essentially passes through the initial cluster solution to see if nodes send
more ties to another group. IF they do, then their group assignment is changed. Groups are then
checked to see if (a) two groups should be merged into 1 or (b) one group should be split into
2. The program has some basic "convergence" checks that speed things up (so it won't look at making
the same change twice, for example), but basically you have to tell in how long to run.
The program will result in an assignment of nodes to groups that are mutually exclusive and exaustive.
There is a special group (number 888) that is reserved for nodes that are "between" groups -- in the
sense that they send ties to multiple different groups.
While you specify a starting number of clusters, the cluster merge / split routine means that the
program will arrive at it's own determination of the number of groups.
The program is pretty robust to starting conditions, though if your network is small enough that you
can take the time to run it multiple times, it's always good to check and see if it converges on the
same solution each time.
Intput: The program takes two types of input files:
Network: The network specified as an EDGE LIST, in the form of Node1 Node2 Value (Value is not currently
used, but it's simple to weight RNM by tie value)
Coordinates: I use the XYZ coordinates from a PAJEK layout to help the initial cluster. You could use
any 3 variables you think would help, or edit the program to remove this section.
Output: The program creates two new datasets.
Partition Results: The partition results are APPENDED TO YOUR INPUT SAS ADJACENCY LIST with the variable
name you give. So, in the example below, I call the input dataset "adjlist" and save the clustering
variable as "Cluster". I don't typically like to write over datasets like this, but it's needed
to get the efficiency gains for very large networks. A set of POS variables are also saved. These
are the variables used to run the RNM cluster. I also include "oldclst", which is just the cluster
the node was assigned to in the last itteration of the program. This is really for internal use, but
if you find nodes where the final cluster differs from oldclst, it might be a marginal case worth looking
at by hand.
Summary Statistics: A dataset called TOTSUM will have 1 observation for every global run of the
program (number of cases = NL). The dataset contains summary information, such as all of the
parameter settings, the resulting number of groups, number of nodes put in the 888 category,
final modularity score (with and without 888 nodes) and so forth. This is useful to see
if the results change much due to random start information. Typlically they don't (which
is good).
Screen Output: The program writes progress to the log (just telling you what node it's made it through
in the itteration processes), so you know that it hasn't frozen if it takes awhile. The output
file contains information on the merging & splitting process. For group merges, it includes a
description of the range of possible merge values, this is useful for adjusting the mrglev paramters.
For group splits, it shows the 2x2 mixing table whenever it thinks one group should be split and
shows what that looks like with some summary stats. It then just gives a final frequency distribution
for the number of groups.
Running the Program.
The top 3/4 or so of this program you shouldn't have to modify. These are the basic macro subroutines that
%largeclst calls. Scroll down to "MODIFY BELOW HERE" to see what to change.
You invoke the program with the %largeclst statement. This macro has a number of arguments described below, I
have found the values in the example below to work pretty well, so you might start there.
nl: Number of loops. This is the number of times to run the entire routine, to see how stable the results
are for multple runs. This determines the number of cases in the summary dataset.
p: Proportion of ties that must fall within group. I usually use 0.51, higher numbers will put more nodes
in the "between group" category.
maxfcl: Maximum number of FASTCLUS clusters. For very large networks (>10,000), you can use FASTCLUS to get a
first cut on the network to reduce it to some smaller number to give Ward's min variance a headstart.
I usually set this to be pretty large. If you set it to 0, the program skips the FASTCLUS routine
and goes straight to the Wards min var clustering.
ncl: Initial number of clusters to ask for w. Wards clustersing.
mrglev: Difference in Modularity score needed to merge two groups into 1. Two groups will be merged into 1 if
doing so improves the modularty score by more than MRGLEV. In practice, attempts to "maximize" the
modularity score by setting this to 0 will usually result in a very small number of groups that, to my
thinking anyway, don't really capture peer groups. Descriptive stats sent to the output will
help you gauge this value.
maxdeg: The largest observed degree in the network. You shouldn't have to change this.
alt: Root name for the alters in the adj list dataset. Again, if you use the structure here you
should not need to change this, but if you read in a different Adj. list dataset, you can.
clstvar: Name for the cluster variable (I use "cluster")
adjdat: Name for the adjacency list dataset
crddat: Name for the coordinate (XYZ) dataset
idvar: Name for the node ID variable
mingn: Smallest allowable group size. Any group smaller than this will be put in the '888' category.
njig: Number of jiggle passes through the network -- program will go fewer if no changes are made
jigit: Number of distinct jiggle itterations (new starting points). Total itterations is product of these
two parameters.
Comments on this example: This example is about 1500 nodes, with clear clusters embedded in a large field of
random ties (the nodes are journals and the ties are co-citations. See: for more detail). In this
example I 3 fresh starts from scratch, 5 "jiggle" passes and "5 within-jiggle" itterations. It
goes quite quickly (total runtime is 5 mins for running it 3 times through, so about 1.6 mins to run
once). It finds about 27 groups, with the modularity scores in the high .70s.
To run yours, it's probably simplest to just modify this example to read your data...
ONLY MODIFY THE CODE THAT FOLLOWS "MODIFY BELOW HERE" below...
Author: Moody
Date: Sep 2005
Modified: Dec 205
*/
/* grpsplt looks to see if a cluster should be split into two groups.
The routine uses CONCOR to break the groups into two, and then looks at the
relative size and ratio of in to outgroup ties.
The routine runs after the first itteration of Jiggle has had a little time
to sort things out.
*/
%macro grpsplt (indat, alt, idvar, clstvar, maxdeg, grpmin, pminv, minrat);
* options nonotes;
data &indat; /* initialize components to zero */
set &indat;
keep &idvar pos1 - pos10 x y z &alt.1 -- &alt.&maxdeg &clstvar egosim _NBRCHNG parts;
parts=.;
run;
proc iml symsize=30000;
%include "&moddir.concor.mod";
%include "&moddir.mixmat.mod";
%include "&moddir.modmix.mod";
%include "&moddir.comps.mod";
file log;
edit &indat;
read all var{&clstvar} into clusters;
clusters=unique(clusters); /* unique set of clusters */
nsplit=0;
grpsplts=0;
do c=1 to ncol(clusters); /* do over each cluster */
cc=clusters[c];
put "GRPSPLT - Checking cluster:" cc;
read all var("&idvar")into snd where(&clstvar=cc);
if (nrow(snd) > 2.25*&grpmin & cc < 888) then do; /* check to see if it needs split */
read all var("&alt.1":"&alt.&maxdeg")into alts where(&clstvar=cc);
deg=(alts^=.)[,+]; /* save some space by chopping off extra vars, might not be needed */
maxdeg=deg[<>];
*put maxdeg;
alts=alts[,1:maxdeg];
allalts=setdif(alts,.);
allalts=union(allalts,snd);
sndloc=j(1,nrow(snd),0);
do i=1 to nrow(snd);
sndloc[i]=loc(allalts=snd[i]);
end;
/* construct an asymetric adjacency matrix of ties to all others (in and out group) */
adjmat=j(ncol(allalts),nrow(snd),0); /* Na by Ns matrix, ij=1 if group member j nominates person i */
/* fill the adj matrix */
do i=1 to nrow(alts);
degi=(alts[i,]^=.)[+];
do j=1 to degi;
altj=loc(allalts=alts[i,j]);
adjmat[altj,i]=1;
end;
end;
/* now get the correlation matrix. This is what CONCORR will run off of.
will be Ns x Ns */
cormat=corr(adjmat);
/* run concor */
ccpart = concor(cormat ,0.05,1); /* will always split in 2 blocks */
/* now we want to evalute the split a little do this
by looking at (a) the size of the splits and (b) the
mixing matrix of internal ties, and (c) the average
correlation values within blocks */
parts=ccpart[,2];
uparts=unique(parts);
sndsnd=adjmat[sndloc,]; /* square internal adj matrix used for mixing matrix */
sndmix=mixmat(sndsnd,parts);
sndpos=sndmix[ncol(sndmix)+1:nrow(sndmix),];
sndmix=sndmix[1:ncol(sndmix),];
grpsize=j(nrow(sndmix),1,0);
do i=1 to nrow(grpsize);
iloc=loc(parts=uparts[i]);
grpsize[i]=ncol(iloc);
end;
/* look at the proportion of ties that fall in-group.
if this is >= to the pminval, then treat it as a
distinct group. Else, merge it with the group
where they send most of their ties */
pingrp=vecdiag(sndmix) / sndmix[,+] ; /* proportion of ties that fall in group */
xgrptpp=(sndmix[,+]-vecdiag(sndmix))/grpsize; /* avg num ties across grp per snd node */
rat_io = vecdiag(sndmix) / (sndmix[,+]-vecdiag(sndmix));
isagrp=(pingrp > &pminv)[+] + (grpsize > &grpmin)[+] + (rat_io > &minrat)[+];
if isagrp = 6 then do; /* meets all of the group requirements, split into 2 */
print "Size, proportion ties in-Block, avg num xties, in-grp:out_grp rat, and mixing matrix for cluster:" cc;
print grpsize pingrp xgrptpp rat_io sndmix;
put "Splitting Cluster: " cc;
end;
else do;
parts=j(nrow(parts),1,1); /*keep as a single group */
end;
end; /* cluster size > 2.25 x grpmin */
else do; /* cluster too small to split or it is 888*/
parts = j(nrow(snd),1,1); /* return a single value */
end;
replace point snd var("parts");
ccparts=ncol(unique(parts)); /* total number of clusters exported */
if ccparts > 1 then do;
grpsplts=grpsplts+1;
end;
end; /* loop over each cluster */
create work.grpsplts from grpsplts [colname={"grpsplts"}];
append from grpsplts;
quit;
proc freq data=&indat noprint;
table &clstvar*parts / out=cncrgrps list; /* cncrgrps gives us the new components */
run;
data cncrgrps;
set cncrgrps;
newclst=_n_;
if count < &grpmin then newclst=888;
if &clstvar = 888 then newclst=888;
run;
proc sort data=&indat out=justclst (keep=&clstvar &idvar parts);
by &clstvar parts;
run;
data justclst;
merge justclst cncrgrps (keep=&clstvar newclst parts);
by &clstvar parts;
run;
proc sort data=justclst (keep=&idvar &clstvar newclst);
by &idvar;
run;
data &indat;
merge &indat justclst;
by &idvar;
run;
proc datasets library=work;
modify &indat;
rename &clstvar=oldclst;
rename newclst=&clstvar;
run; quit;
%mend;
/* compchk looks to see if the cluster is a connected component */
%macro compchk (indat, alt, idvar, clstvar, maxdeg, grpmin);
data &indat; /* initialize components to zero */
set &indat;
keep &idvar pos1 - pos10 x y z &alt.1 -- &alt.&maxdeg &clstvar egosim _NBRCHNG c_comp;
c_comp=.;
run;
options nonotes;
proc iml symsize=30000;
%include "&moddir.comps.mod";
file log;
edit &indat;
read all var{&clstvar} into clusters;
clusters=unique(clusters); /* unique set of clusters */
nsplit=0;
do c=1 to ncol(clusters); /* do over each cluster */
cc=clusters[c];
put "COMPCHK - Checking cluster:" cc;
if cc < 888 then do;
read all var("&alt.1":"&alt.&maxdeg")into alts where(&clstvar=cc);
read all var("&idvar")into snd where(&clstvar=cc);
deg=(alts^=.)[,+]; /* save some space, might not be needed */
maxdeg=deg[<>];
*put maxdeg;
alts=alts[,1:maxdeg];
/* remove ties to non-cluster members */
comp=j(nrow(alts),ncol(alts),0);
do i=1 to nrow(snd);
comp=comp+(alts=snd[i]);
end;
alts=alts#comp;
deg=(alts>0)[,+];
maxdeg=deg[<>];
*put maxdeg;
if maxdeg > 0 then do;
newalt=j(nrow(alts),maxdeg,.);
renum=(1:nrow(alts))`;
do i=1 to nrow(alts);
alti=setdif(alts[i,],{0 .});
*print alti;
do j=1 to deg[i];
newalt[i,j]=loc(snd=alti[j]);
end;
end;
newalt=renum||newalt;
c_comp=comps(newalt);
replace point snd var("c_comp"); /* update the c_comp value */
if ncol(unique(c_comp))> 1 then do;
put" found multiple components";
nsplit=nsplit+1;
end;
end;
else do; /* could only mean that there are no internal ties */
c_comp=1:nrow(snd);
c_comp=c_comp`;
replace point snd var("c_comp"); /* update the c_comp value */
end;
end; /* cc < 888 */
else do; /* just update grp 888 */
c_comp=j(nrow(snd),1,1);
replace point snd var("c_comp"); /* update the c_comp value */
end;
end;
create work.nsplit from nsplit [colname={"nsplit"}];
append from nsplit;
quit;
options notes;
proc freq data=&indat noprint;
table &clstvar*c_comp / out=clstcomp; /* clstcomp gives us the new components */
run;
data clstcomp;
set clstcomp;
newclst=_n_;
if count < &grpmin then newclst=888;
if &clstvar = 888 then newclst=888;
run;
proc sort data=adjlist out=justclst (keep=&clstvar &idvar c_comp);
by &clstvar c_comp;
run;
data justclst;
merge justclst clstcomp (keep=&clstvar newclst c_comp);
by &clstvar c_comp;
run;
proc sort data=justclst (keep=&idvar &clstvar newclst);
by &idvar;
run;
data &indat;
merge &indat justclst;
by &idvar;
run;
proc datasets library=work;
modify &indat;
rename &clstvar=oldclst;
rename newclst=&clstvar;
run; quit;
%mend;
/* Jiggle is a macro that adjusts node membership based on
where the majority (pval) of their ties fall. I have modified it
to also disolve small groups into the 888 category, ensure that each cluster
is at least a connected component, and split clusters that might be
two clusters. */
%macro jiggle(indat, snd, alt, clstvar, pval, grpmin, itr, jigpass, g, maxdeg, siglev);
%do j=1 %to &jigpass; /* number of times to do the full jiggle/trim set */
%put "Starting JIGGLE Iteration: &j";
data &indat;
set &indat;
_nbrchng=1; /* create a control var, tells us where a jiggle could possibly make a difference */
%if &j = 1 %then %do;
egosim=.; /* records the proportion of ego's nbrs that are in ego's grp */
%end;
run;
/* first make sure that clusters are connected components.
Do this by looping through each one. This process also
renames the clusters, but that's OK. Should probably
do this at the end of every loop, but I think it's
expensive, so I'll start by just doing it at the top of every Jiggle itteration.. */
%if &j > 1 %then %do;
%compchk(&indat,
&alt,
&idvar,
&clstvar,
&maxdeg,
&grpmin);
%end;
%do k=1 %to &itr;
proc iml symsize=30000;
options nonotes;
start propcnt(mat);
set=unique(mat);
setsame=j(1,ncol(set),0);
do i=1 to ncol(set);
setsame[i]=(mat=set[i])[:];
end;
propcnt=set//setsame;
return(propcnt);
finish;
file log;
edit &indat;
put "Jiggle Iteration: " &k " of " &itr;
changed=0;
do i=1 to &g; /* do over each person in the network */
if mod(i,500)=0 then do;
put " Node:" i;
end;
read point i var("_nbrchng") into _nbrchng;
if _nbrchng > 0 then do; /* first time through this is everyone */
*put i; /* just testing the loop structure */
read point i var("&alt.1":"&alt.&maxdeg") into alts; /* people ego i is tied to */
read point i var("&clstvar") into &clstvar;
alts=setdif(alts,.);
if type(alts)='N' then do; /* the are tied */
read point alts var("&clstvar") into altclust; /* clusters that ego's nbrs blng to */
egosim=(altclust=&clstvar)[:]; /* proportion of neighbors in same group as ego */
if egosim < &pval then do; /* most of ego's nbrs are not in ego's cluster */
ac_dist=propcnt(altclust); /* distribution of ego's neighbors' clusters */
if any(ac_dist[2,]>&pval) then do; /* there is a single cluster with > p ties */
&clstvar=ac_dist[1,loc(ac_dist[2,]>&pval)]; /* assign ego to that cluster */
_nbrchng=2;
changed=changed+1;
egosim=(altclust=&clstvar)[:];
replace point i var({"&clstvar" "_nbrchng" "egosim"});
end;
else if egosim=0 then do; /* they have no ties to anyone in their assigned group */
altmx=ac_dist[2,<>];
altmx=loc(ac_dist[2,]=altmx); /* if there is more than 1, then this is somewhat arbitrary */
if ncol(altmx)> 1 then do;
uni=j(1,ncol(altmx),0);
uni=ranuni(uni);
altmx=loc(uni=uni[<>]);
end;
&clstvar=ac_dist[1,altmx];
_nbrchng=2;
changed=changed+1;
egosim=(altclust=&clstvar)[:];
replace point i var({"&clstvar" "_nbrchng" "egosim"});
end;
free ac_dist;
end; /* ego sim */
else do; /* ego has more ties in group than pv */
replace point i var({"egosim"}); /*update the egosim for future use */
end;
free altclust egosim;
end; /* ego has alters */
else do; /*ego has no alters, means isolated, should not happen */
&clstvar = 888;
replace point alts var("_nbrchng");
changed=changed+1;
end;
free alts &clstvar;
end; /* nbrchng loop */
end; /* loop over g */
put "Number of nodes moved:" changed;
create work.changed from changed;
append from changed;
*options notes;
quit;
data changed;
set changed;
call symput('changed',col1);
run;
%if &changed=0 %then %do; /* have been no changes, so stop looking for new ones */
%let k=&itr;
%end;
%end; /* end Jiggle iterations. &indat now has the updated cluster values */
options notes;
/* now look at size of groups and remove any that are below grpmin */
proc freq data=&indat noprint;
tables &clstvar/ out=_clfrq;
run;
proc sort data=&indat;
by &clstvar;
run;
data &indat;
merge &indat _clfrq;
by &clstvar;
run;
data &indat;
set &indat;
if count < &grpmin then &clstvar = 888;
run;
proc sort data=&indat;
by snd;
run;
proc freq data=&indat noprint;
tables &clstvar/ out=_clfrq;
run;
/* now look to see if any groups can be merged. Might want to only run this
at the last few times through... */
options nonotes;
proc iml symsize=30000;
%include "&moddir.symet.mod";
%include "&moddir.modmix.mod";
%include "&moddir.grpmodmrg.mod";
%include "&moddir.gmm_recode.mod";
file log;
use work._clfrq;
read all var{cluster} into clustset;
cmat=j(nrow(clustset),nrow(clustset),0); /* create a blank mixing matrix */
edit &indat;
/* build the mixing matrix */
do i=1 to # /* do over each person in the network */
read point i var("&alt.1":"&alt.&maxdeg") into alts; /* people ego i is tied to */
read point i var("&clstvar") into egoclst; /* ego's cluster */
iloc=loc(clustset=egoclst);
alts=setdif(alts,.); /* nonmissing alters */
if type(alts)='N' then do; /* the are not isolated */
read point alts var("&clstvar") into altclust; /* clusters that ego's nbrs blng to */
do j=1 to ncol(alts);
jloc=loc(clustset=altclust[j]);
cmat[iloc,jloc]=cmat[iloc,jloc]+1;
end;
end; /* nonisolated */
end; /* loop over nodes */
mattrib cmat format=3.0;
mod_tot=modmix(cmat);
ctrim=cmat[1:nrow(cmat)-1,1:nrow(cmat)-1];
mod_trim=modmix(ctrim);
/* now look for mergers */
mergcheck=grpmodmrg(ctrim,&siglev,clustset);
if mergcheck[+] > 0 then do; /* there are groups to merge */
cchange=((1:nrow(ctrim))`)||j(nrow(ctrim),1,0);
do i=1 to nrow(cchange);
chloc=loc(mergcheck[,4]=cchange[i,1]);
if type(chloc)='N' then do;
cchange[i,2]=mergcheck[chloc[1],3];
end;
else do;
cchange[i,2]=cchange[i,1];
end;
end;
newrows=gmm_recode(cchange);
grpchng=nrow(mergcheck);
*print cchange newrows;
do i=1 to # /* do over each person in the network */
read point i var("&clstvar") into origc; /* ego's cluster */
ci=loc(clustset=origc);
if ci <= nrow(cchange) then do;
&clstvar = newrows[ci];
file log;
*put origc &clstvar;
replace point i var("&clstvar");
end;
end;
end; /* mergcheck>0 */
else do; /* no groups to merge at this siglevel */
grpchng=0;
end;
create work.grpchng from grpchng [colname={"Grpchng"}];
append from grpchng;
*options notes;
quit;
/* now look to see if any groups have been overclustered */
%if &j > 1 %then %do;
%grpsplt (&indat,
&alt,
&idvar,
&clstvar,
&maxdeg,
&grpmin,
&pval,
5); /* min ratio of in-group to out-group ties*/
%end;
%put "Ending JIGGLE Pass: &j";
/* now check to see if there is any point in continuing. If NSPLIT = 0 then no
groups have been split, if CHANGED = 0 then no person has been moved, and if
GRPCHG = 0 then no groups have been merged. At that point, no changes will
be made and we should stop. Must run at least 2 times. */
%if &j > 1 %then %do;
data changed;
set changed;
call symput('changed',col1);
run;
data nsplit;
set nsplit;
call symput('nsplit',nsplit);
run;
data grpchng;
set grpchng;
call symput('grpchng',grpchng);
run;
%put "Checking for convergence: Nsplit: &nsplit Changed: &changed Grpmerge: &grpchng ";
%if &changed = 0 & &nsplit=0 & &grpchng = 0 %then %do;
%Put "No Change in Last Iteration, Ending JIGGLE Passes ";
%let j=&jigpass;
%end;
%end;
%end; /* end jiggle passes */
* proc datasets library=work;
* delete _clfrq;
* run;
%mend;
/* Largeclst is the main program that calls the other jiggle macros as many times as you want. Useful for comparing
cluster results over different trials, since the process is based on an underlying random process results
can differ accross runs. It also makes it simple to pull all the parts together */
%macro largeclst(nl, p, maxfcl, ncl, mrglev, maxdeg, alt, clstvar, adjdat, crddat, idvar, mingn, njig, jigit);
%do l=1 %to &nl;
%numobs(&adjdat); /* now &num = number of nodes */
%include "&macdir.ranpos.mac";
%include "&macdir.posnum.mac";
%ranpos(&adjdat, /* input data set */
&alt, /* root for the alter */
snd, /* name for ego in network */
&num, /* number of cases in the network */
&maxdeg, /* maximum degree in the network */
7, /* number of iterations for the mean change */
10); /* number of positioinal vectors to calculate */
/* now run the clustering. Note the for the small test here,
I skip the fastclust step. THis probably means that the starting
clusters are better here than they would be otherwise. */
data &adjdat;
merge &adjdat &crddat;
run;
/* standardize the data */
proc standard data=&adjdat out=stand mean=0 std=1;
var pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 x y z;
run;
%if &maxfcl > 0 %then %do;
/* run fastclust */
proc fastclus data=stand out=fclust mean=fclus_mx maxc=&maxfcl maxiter=100
converge=0 noprint;
var pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 x y z;
run;
proc datasets library=work nolist;
modify fclust;
rename cluster=fclustid;
modify fclus_mx;
rename cluster=fclustid;
run;
proc cluster method=ward data=fclus_mx outtree=tree noprint;
var pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 x y z;
id fclustid;
run;
proc tree data=tree ncl=&ncl out=clust;
copy fclustid;
run;
proc sort data=clust out=clust tagsort;
by fclustid;
run;
proc sort data=fclust; /* reappend the initial ids */
by fclustid;
run;
data clust;
merge clust fclust;
by fclustid;
run;
%end;
%else %do; /* do not run fastclust, just run WARD */
proc cluster method=ward data=stand outtree=tree noprint;
var pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 x y z;
id &idvar;
run;
proc tree data=tree ncl=&ncl out=clust noprint ;
copy &idvar;
run;
%end;
proc sort data=clust;
by &idvar;
run;
data &adjdat;
merge &adjdat clust;
by &idvar;
run;
/* now adjust that set of starting clusters */
%jiggle(&adjdat, /* indat */
&idvar, /* node label name*/
&alt, /* root name for adjlist */
&clstvar, /* cluster variable */
&p, /* pval: require at least this prop of ingroup ties */
&mingn, /* smallest allowable group size */
&njig, /* number of times to run jiggle */
&jigit, /* number of times to run jig/trim => (itr * jigpass) total runs of jiggle */
&num, /* number of nodes */
&maxdeg, /* maximum degree of any node */
&mrglev); /* change in modularity that allows a group merge */
proc freq data=adjlist;
tables cluster / out=clfrq;
run;
/* get the final modularity score */
proc iml symsize=30000;
%include "&moddir.symet.mod";
%include "&moddir.modmix.mod";
file log;
use work.clfrq;
read all var{cluster} into clustset;
read all var{count} into clustcount;
cmat=j(nrow(clustset),nrow(clustset),0); /* create a blank mixing matrix */
edit work.adjlist;
do i=1 to # /* do over each person in the network */
read point i var("&alt.1":"&alt.&maxdeg") into alts; /* people ego i is tied to */
read point i var("&clstvar") into egoclst; /* ego's cluster */
iloc=loc(clustset=egoclst);
alts=setdif(alts,.); /* nonmissing alters */
if type(alts)='N' then do; /* the are not isolated */
read point alts var("&clstvar") into altclust; /* clusters that ego's nbrs blng to */
do j=1 to ncol(alts);
jloc=loc(clustset=altclust[j]);
cmat[iloc,jloc]=cmat[iloc,jloc]+1;
end;
end; /* nonisolated */
end; /* loop over nodes */
mattrib cmat format=2.0;
*print clustset cmat;
mod_tot=modmix(cmat);
ctrim=cmat[1:nrow(cmat)-1,1:nrow(cmat)-1];
mod_trim=modmix(ctrim);
print "Final Modularity Score:";
print "For all nodes:" mod_tot;
print "For only non888 nodes" mod_trim;
pval=&p;
ngrps=nrow(clustset)-1; /* don't count the 888s */
n888=clustcount[nrow(clustcount)];
outd=pval||ngrps||n888||mod_tot||mod_trim;
create work.clustsum from outd [colname={parm ngrps n888 mod_tot mod_trim}];
append from outd;
quit;
/* add the relevant parameters to the clustsum dataset */
data clustsum;
set clustsum;
Maxfclst=&maxfcl;
WardNclst=&ncl;
MergeLev=&mrglev;
MinGrpN=&mingn;
Pvalue=&p;
run;
proc append base=totsum data=clustsum;
run;
proc datasets library=work;
delete changed clfrq clust convchk copy posdat dum stand tree _clfrq clustsum clstcomp cncgrps fclust
flcus_mx grpchng grpsplts grpstat justclst nsplit;
run; quit;
/* store the cluster so we compare them across runs of the model */
%if &l = 1 %then %do;
data c_store;
set &adjdat;
keep &idvar &clstvar;
run;
%end;
%else %do;
data c_store;
merge c_store &adjdat (keep=&idvar &clstvar);
by &idvar;
run;
%end;
proc datasets library=work;
modify c_store;
rename &clstvar = &clstvar.&l;
run;
%end;
options notes;
%mend;
/******************************************/
/********* MODIFY BELOW HERE ************
Specify details of the program here. Key
elements are simply telling the program
where the data are...
******************************************/
%let macdir = c:\jwm\sas\programs\jiggle\jigfiles\; /* where you have saved your SAS modules */
%let moddir = c:\jwm\sas\programs\jiggle\jigfiles\; /* where you have saved your SAS Macros */
%let datdir = c:\jwm\presentations\ssrinets\scratch\; /* input data directory */
%let edgfile = adjis_lcelist.txt; /* simple ASCI edge list of network w 3 vars: node1 Node2 edgevalue */
%let xcrd = adjis_lcx.vec; /* X coordinates of network layout */
%let ycrd = adjis_lcy.vec; /* y coordinates of network layout */
%let zcrd = adjis_lcz.vec; /* z coordinates of network layout */
/* bring in the data */
/* numobs tells us how many cases are in a dataset */
%include "&macdir.numobs.mac";
options bufno=6;
data edges;
*infile "c:\moody\sas\programs\jigtestdat1.txt";
infile "&datdir.&edgfile";
input snd rcv val;
run;
data copy; /* need both sides of the edge for searches. So I make a copy of both i-->j and j-->i */
set edges;
rename snd=snd2;
rename rcv=snd;
run;
data copy;
set copy;
rename snd2=rcv;
run;
proc append base=edges data=copy;
run;
proc sort data=edges out=edgsrt;
by snd rcv;
run;
proc transpose data=edgsrt out=adjlist;
var rcv;
by snd;
run;
/* calculate the largest degree */
proc contents data=adjlist noprint out=adjcont;
run;
%numobs(adjcont);
%let maxdeg = %eval(&num - 2);
/* clean some stuff up */
proc datasets library=work nolist;
delete edges edgsrt adjcont;
run;
/* now add the XY coords to the position variables */
data x;
infile "&datdir.&xcrd";
input x;
run;
data y;
infile "&datdir.&ycrd";
input y;
run;
data z;
infile "&datdir.&zcrd";
input z;
run;
data crds;
merge x y z;
if x = . then delete;
y=1-y; /* reverse to match PAJEK and SAS plots */
run;
data time; /* just like to know how long the whole thing takes */
st=time();
run;
proc datasets library=work;
delete x y z;
run;
/***************************************************/
/* ******* MODIFY BELOW HERE ********************/
/* THIS IS THE MAIN PROGRAM CALL. Here is where you can set
the options specified in the header above */
/***************************************************/
%largeclst(6, /* number of times to run the entire set - looks for variability*/
0.51, /*proportion of ties that must be in-group */
0, /* number of FASTCLUST clusters, set to 0 if you want to skip FASTCLUS */
30, /* number of WARD clusters */
.003, /* group merge level. Modularity difference must be greater than this to merge groups */
&maxdeg, /* maximum degree - set internally, no need to adjust */
col, /* root name for alter var - should be no need to change */
cluster,/* name of cluster variable - you can call this what you want*/
adjlist, /* name of adj list data */
crds, /* name of coordinate data */
snd, /* name of node id variable */
5, /* min group size */
7, /* number of times to jiggle */
7); /* number of times to run jig x merge */
data time;
set time;
runtime = (time()-st)/60; /* time in minutes */
run;
proc print data=time;
run;
/* ego sim is the proprtion of alters that are in the same group as ego. Should be high, but
it's not all that useful...*/
proc univariate noprint data=adjlist;
histogram egosim / midpoints = 0 to 1 by .05 cfill=red;
inset mean="Mean" std="Std" / noframe pos=ne;
run;
/* you can write the file out to a PAJEK file pretty simply */
%numobs(adjlist);
data dum;
file "&datdir.adjis_jigclst1.clu";
put "*Vertices &num";
run;
data c_store;
set c_store;
file "&wrkdir.adjis_jigclst1.clu" mod;
put cluster4;
run;
proc freq data=c_store;
tables cluster4 / out=fnlclstfrq;
run;