Generative hypergraph clustering: from blockmodels to modularity

Hypergraphs are a natural modeling paradigm for a wide range of complex relational systems. A standard analysis task is to identify clusters of closely related or densely interconnected nodes. Many graph algorithms for this task are based on variants of the stochastic blockmodel, a random graph with flexible cluster structure. However, there are few models and algorithms for hypergraph clustering. Here, we propose a Poisson degree-corrected hypergraph stochastic blockmodel (DCHSBM), a generative model of clustered hypergraphs with heterogeneous node degrees and edge sizes. Approximate maximum-likelihood inference in the DCHSBM naturally leads to a clustering objective that generalizes the popular modularity objective for graphs. We derive a general Louvain-type algorithm for this objective, as well as a a faster, specialized"All-Or-Nothing"(AON) variant in which edges are expected to lie fully within clusters. This special case encompasses a recent proposal for modularity in hypergraphs, while also incorporating flexible resolution and edge-size parameters. We show that AON hypergraph Louvain is highly scalable, including as an example an experiment on a synthetic hypergraph of one million nodes. We also demonstrate through synthetic experiments that the detectability regimes for hypergraph community detection differ from methods based on dyadic graph projections. We use our generative model to analyze different patterns of higher-order structure in school contact networks, U.S. congressional bill cosponsorship, U.S. congressional committees, product categories in co-purchasing behavior, and hotel locations from web browsing sessions, finding interpretable higher-order structure. We then study the behavior of our AON hypergraph Louvain algorithm, finding that it is able to recover ground truth clusters in empirical data sets exhibiting the corresponding higher-order structure.

Hypergraphs are a natural modeling paradigm for a wide range of complex relational systems, with nodes representing system components and hyperedges representing multiway interactions. A standard analysis task is to identify clusters of closely related or densely interconnected nodes. In the special case of graphs, in which edges connect exactly two nodes at a time, there are a number of probabilistic generative models and associated inference algorithms for clustering. Many of these are based on variants of the stochastic blockmodel, a random graph with flexible cluster structure. However, there are few models and algorithms for hypergraph clustering. Here, we propose a Poisson degree-corrected hypergraph stochastic blockmodel (DCHSBM), an expressive generative model of clustered hypergraphs with heterogeneous node degrees and edge sizes. Approximate maximum-likelihood inference in the DCHSBM naturally leads to a clustering objective that generalizes the popular modularity objective for graphs. We derive a general Louvain-type algorithm for this objective, as well as a a faster, specialized "All-Or-Nothing" (AON) variant in which edges are expected to lie fully within clusters. This special case encompasses a recent proposal for modularity in hypergraphs, while also incorporating flexible resolution and edge-size parameters. We show that AON hypergraph Louvain is highly scalable, including as an example an experiment on a synthetic hypergraph of one million nodes. We also demonstrate through synthetic experiments that the detectability regimes for hypergraph community detection differ from methods based on dyadic graph projections. In particular, there are regimes in which hypergraph methods can recover planted partitions even though graph based methods necessarily fail. We use our generative model to analyze different patterns of higher-order structure in school contact networks, U.S. congressional bill cosponsorship, U.S. congressional committees, product categories in co-purchasing behavior, and hotel locations from web browsing sessions, finding interpretable higher-order structure. We then study the behavior of our AON hypergraph Louvain algorithm, finding that it is able to recover ground truth clusters in empirical data sets exhibiting the corresponding higher-order structure.

introduction
Graphs are a fundamental abstraction for complex relational systems throughout the sciences [Jackson, 2008;Easley and Kleinberg, 2010;Newman, 2010]. A graph represents components of a system by a set of nodes, and interactions or relationships among these components using edges that connect pairs of nodes. Much of the structure in complex data, however, involves higher-order interactions and relationships between more than two entities at once [Milo, 2002;Benson et al., 2016Benson et al., , 2018Lambiotte et al., 2019;Battiston et al., 2020;Torres et al., 2020]. Hypergraphs are now a burgeoning paradigm for modeling these and many other systems [Li and Milenkovic, 2018;de Arruda et al., 2020;Sahasrabuddhe et al., 2020;Veldt et al., 2020b]. A hypergraph still represents the components by a set of nodes, but the edges (often called hyperedges) may connect arbitrary numbers of nodes. A graph is a special case of a hypergraph, in which each edge connects exactly two nodes.
A well-established graph clustering approach is to model the graph as a sample from a probabilistic generative model , in which case the clustering task can be recast as a statistical inference problem [Nowicki and Snijders, 2001;Hoff et al., 2002;Airoldi et al., 2008;Karrer and Newman, 2011;Yang and Leskovec, 2013;Peixoto, 2014;Athreya et al., 2017]. While generative modeling is a mainstay in graph clustering, generative techniques for hypergraphs are largely lacking. Indeed, while a small number of generative models of clustered hypergraphs have been proposed [Ghoshdastidar and Dukkipati, 2014;Kim et al., 2018;Angelini et al., 2016;Ke et al., 2019], these models typically generate hypergraphs with edges of only one size. With a recent exception [Ke et al., 2019], these models also do not model degree heterogeneity between nodes. Heterogeneity in edge size and node degree are both key features of empirical data [Benson et al., 2018], and their omission limits the applicability of many of these models for practical data analysis. An alternative to generative hypergraph modeling is to transform the hypergraph into a dyadic graph via clique expansion, where a dyadic edge connects any pair of nodes that appear together in some hyperedge [Zhou et al., 2006;Benson et al., 2016]. While this enables the use of a wide array of existing models and algorithms for graphs, the higher-order structure is lost [Chodrow, 2020], and generative models of the resulting dyadic graph may rely on explicitly violated independence assumptions. Recently, nongenerative approaches based on the popular modularity clustering objective for graphs [Newman and Girvan, 2004] have been proposed for hypergraphs [Kumar et al., 2020;Kamiński et al., 2019;Veldt et al., 2020c], although their lack of connection to a generative model limits their interpretability.
Another approach to generative clustering is to use the representation of a hypergraph as a bipartite graph, and apply a generative model (e.g. [Larremore et al., 2014;Gerlach et al., 2018;Yen and Larremore, 2020]) to the latter representation. This approach, while appropriate in many data sets, involves a strong assumption: the memberships of any two nodes in a given hyperedge are independent, conditional on the model parameters. This assumption is natural for certain classes of data. For example, consider an event co-attendance network, with nodes representing music enthusiasts and hyperedges representing concerts. Node membership in a hyperedge corresponds to attendance at the specified event. To a reasonable approximation, the decision of two fans to attend a given concert may indeed be independent, conditioned on the popularity of the performers, the location of the venue, and so on. In other data sets, however, the conditional independence assumption is explicitly violated. Multiway social interaction networks give one important class of examples. Interactions such as gossip, for instance, normally take place only between trusted individuals. The presence of a single uninvited outsider may entirely prevent the interaction from taking place. The "all-or-nothing" structure of such interactions is an important violation of the conditional independence assumptions made by most bipartite generative models. These examples highlight that the task of matching assumptions to higher-order data is an ongoing challenge, for which we benefit from a diversity of distinct tools.
Here, we propose a generative approach to hypergraph clustering based on a degree-corrected hypergraph stochastic blockmodel (DCHSBM). This model generates clustered hypergraphs with heterogeneous degree distributions and hyperedge sizes. We outline an approximate coordinate-ascent maximum-likelihood estimation scheme for fitting this model to hypergraph data, and show that one stage of this scheme generalizes the well-studied modularity objective for graphs. We derive accompanying Louvain algorithms for this class of modularity-like objectives, which are highly scalable in an important special case. We show computationally that hypergraph clustering methods are able to detect planted clusters in regimes in which graph-based methods necessarily fail due to known theoretical limits. We also show that, in data sets with appropriately matched higher-order structure, our generative hypergraph techniques are able to recover clusters correlated to metadata at higher rates than graph-based techniques. Our results highlight the importance of matching generative models to data sets, and point toward a number of directions for further work in higher-order network science.
2 the degree-corrected hypergraph stochastic blockmodel The degree-corrected stochastic blockmodel is a generative model of graphs with both community structure and heterogeneous degree sequences [Karrer and Newman, 2011]. We now extend this model to the case of hypergraphs. For our model, let n be the number of nodes in a hypergraph. Each node i is assigned to one of¯ groups. We let z i ∈ [¯ ] = {1, 2, . . . ,¯ }, 1 denote the 1 Here and elsewhere, we use the notation [r] = {1, . . . , r}.
group assignment of node i, and collect these assignments in a vector z ∈ [¯ ] n . As in the dyadic degree-corrected SBM, each node i is assigned a parameter θ i governing its degree, and we collect these parameters in a vector θ ∈ R n . Let R represent the set of unordered node tuples, so that each R ∈ R is a set of nodes representing the location of a possible hyperedge. (Following the standard choice for the degree-corrected SBM in graphs, we allow R to include node tuples with repeated nodes.) Let z R denote the vector of cluster labels for nodes in a given tuple R, and θ R the vector of degree parameters.
We use an affinity function Ω to control the probability of placing a hyperedge at a given node tuple R, which depends on the group memberships of the nodes in R. Formally, Ω maps the group assignments z R to a nonnegative number.
If Ω(z R ) is large, there is a higher probability that a hyperedge forms between the nodes in R. In our model, the number of hyperedges placed at R ∈ R is distributed as a R ∼ Poisson (b R π(θ R )Ω(z R )), where b R denotes the number of distinct ways to order the nodes of R and π(θ R ) = i∈R θ i is the product of degree parameters. The probability of realizing a given value a R is then This edge generation process has the following intuitive interpretation: for each of the b R possible orderings of nodes in R, we attempt to place a Poisson (π(θ R )Ω(z R )) number of hyperedges on this tuple. The result is a weighted hyperedge on the unordered tuple R, whose weight can be any nonnegative integer. This is a helpful modeling feature, as many empirical hypergraphs contain multiple hyperedges between the same set of nodes. 2 The probability of realizing a given hyperedge 2 Even in hypergraph data sets where we only know the presence or absence of hyperedges (but no weights), the Poisson-based model serves as a computationally convenient approximation to a Bernoulli-based model.
set A = (a R ) R∈R is then just the product of probabilities over each R ∈ R.

ESTIMATION OF DEGREE AND AFFINITY PARAMETERS
There are many methods for inference in stochastic blockmodels and their relatives, including variational coordinate ascent [Airoldi et al., 2008], variational belief propagation [Decelle et al., 2011;Zhang and Moore, 2014], and Markov Chain Monte Carlo [Nowicki and Snijders, 2001;Peixoto, 2020]. We perform approximate maximum-likelihood inference via coordinate ascent. We do so in order to exploit a recent connection between maximum-likelihood inference in stochastic blockmodels and the popular modularity objective for graph clustering [Newman, 2016]. Our coordinate ascent framework, in which we alternate between estimating parameters and node labels, is a close relative of expectation-maximization (EM) algorithms for blockmodel inference [Decelle et al., 2011]. Standard versions of EM construct "soft" clusters, in which each node is given a weighted assignment in every possible cluster. "The" cluster for a given node is often taken to be the cluster in which the node has largest weight. In contrast, our approach generates "hard" clusters in which each node belongs to exactly one cluster. Profile likelihood methods offer an alternative framework for maximum-likelihood inference in SBMs [Bickel and Chen, 2009], and their development for hypergraphs is another promising avenue of future work.
In the maximum-likelihood framework, we learn estimatesẑ of the node labels,Ω of the affinity function, andθ of the degree parameters by solving the optimization problemẑ where A is a given data set represented by a collection of (integer-weighted) hyperedges. As usual, it is easier to work with the log-likelihood, which has the same local optima. The log-likelihood is where The first term Q(z, Ω, θ) is the only part of the log-likelihood that depends on the group assignments z and affinity function Ω. The second term depends on θ, while the third term depends only on the data A and can be disregarded for inferential purposes.
In the coordinate ascent approach to maximum-likelihood, we alternate between two stages. In the first stage, we assume a current estimateẑ and obtain new estimates of Ω and θ by solvinĝ The resultingΩ,θ can be viewed as maximum-likelihood estimates, conditioned on the current estimateẑ of the label vector z. In the second stage, we assume current estimatesΩ andθ and obtain a new estimate of z by solvinĝ We alternate between these two stages until convergence.
There are several identifiability issues that must be addressed. First, permuting the group labels in z and Ω does not alter the value of the likelihood. We therefore impose an arbitrary order on group labels. Second, the number of possible groups can in principle be larger than the number of groups present in z. Such a case would correspond to the presence of groups which are statistically possible but empty in the given data realization. While other treatments are possible, we choose to disregard empty groups and treat¯ as equal to the number of distinct labels in an estimate of z. A final form of unidentifiability relates to the scales of θ and Ω. For a fixed θ and Ω, we can construct θ = θ and Ω = Ω such that L(z, Ω, θ) = L(z, Ω , θ ) (Appendix A). To enforce identifiability, we must therefore place a joint normalization condition on either θ or Ω. We choose to constrain θ such that where vol( ) = n i=1 d i δ(z i , ) and d i is the (weighted) number of hyperedges in which node i appears. 3 The usefulness of (9) is that, when z is known or estimated, the conditional maximum-likelihood estimatesθ andΩ in (7) take simple, closed forms. First, for a fixed label vector z, when using the normalization (9), the maximum-likelihood estimate for θ isθ (See Appendix B.) Second, conditioned on z, if Ω takes constant value ω on some set Y of unordered tuples of labels, 4 the maximum likelihood estimate for ω is 4 In full generality, we can have one such ω for every possible label arrangement for each hyperedge size in the data. Later, we will make natural restrictions on f Omega.
(See Appendix C.) Although (10) assumes that z was fixed, it is not necessary to know z in order to form the estimateθ. However, forming the estimateω via (11) requires that we know or estimate z. It is therefore important to remember thatω is not a globally optimal estimate, but rather a locally optimal estimate conditioned on the currently estimated group labels. The formula (11) could also be inserted directly into the full likelihood maximization problem (2), eliminating the parameters corresponding to Ω and producing a lower-dimensional profile likelihood, which could then in principle be optimized directly. This approach has been successful for dyadic blockmodels [Bickel and Chen, 2009], and the development of similar methods for hypergraph blockmodels would be of considerable interest. The advantage of our coordinate ascent framework is that we are able to develop fast heuristics for solving (8), by generalizing widely-used algorithms for graph clustering (Section 4). Solving problem (7) in our framework can also be interpreted as evaluating the profile likelihood for a fixed cluster vector z, highlighting the relationship between these approaches.
We now turn to the problem of inferring the label vector z. This problem leads naturally to a class of modularity-type objectives for hypergraph clustering.

hypergraph modularities
Our results from the previous section imply that the estimated degree parameter θ and piecewise constant affinity functionΩ can be efficiently estimated in closed form, provided an estimate of z. This provides a solution to the first stage (7) of coordinate ascent. We now discuss the second stage (8). From (3), it suffices to optimize Q with respect to z. To do so, it is helpful to impose some additional structure onΩ.

SYMMETRIC MODULARITIES
We obtain an important class of objective functions by stipulating that Ω is symmetric with respect to permutations of node labels. In this case, Ω(z R ) depends not on the specific labels z R in a given node tuple R, but only on the number of repetitions of each. Statistically, the corresponding DCHSBM generates hypergraphs in which all groups are statistically identical, conditioned on the degrees of their constituent nodes. Symmetric affinity functions thus give a flexible generalization of the planted partition stochastic blockmodel [Jerrum and Sorkin, 1998;Condon and Karp, 2001] to the setting of hypergraphs. Define the function φ(z) = p, where p j is the number of entries of z in the jth largest group in z, with ties broken arbitrarily. For example, if z = (1,1,4,1,2,3,2), then p = (3, 2, 1, 1). We call p a partition vector. The symmetry assumption implies that Ω is a function of z R only through p = φ(z R ). Accordingly, we abuse notation by writing Ω(p) ≡ Ω(z) when p = φ(z).
We now define generalized cuts and volumes corresponding to a possible partition vector p for tuples of k nodes: where R k is the subset of tuples in R consisting of k nodes. The function cut p (z) counts the number of edges that are split by z into the specified partition p, while the function vol p (z) is a sum-product of volumes over all grouping vectors y that induce partition p. Let P be the set of partition vectors on sets up to sizek, the maximum size of hyperedges. We show in Appendix D that the symmetric modularity objective can then be written as For a partition vector p for tuples of k nodes, direct calculation of vol p (z) is a summation of k elements, which can be impractical when either or k are large. We show in Appendix E, however, that it is possible to evaluate these sums efficiently via a combinatorial identity. We also give a formula for updating volume terms vol p (z) when a candidate labeling is modified. The objective function (14) is related to the multiway hypergraph cut problem studied by Veldt et al. [2020a]. They formulate the hypergraph cut objective in terms of splitting functions, which associate penalties when edges are split between two or more clusters. One then aims to minimize the sum of penalties subject to constraints that certain nodes must not lie in the same cluster. Symmetric affinity functions in our framework correspond to signature-based splitting functions in their terminology. Table 1 lists four of many families of affinity functions.
The All-Or-Nothing affinity function distinguishes only whether or not a given edge is contained entirely within a single cluster. This affinity function is especially important for scalable computation, and we discuss it further below. The Group Number affinity depends only on the number of distinct groups represented in an edge, regardless of the number of incident nodes in each one. Special cases of the Group Number affinity arise frequently in applications. When f ( p 0 , k) = exp( p 0 −1), the first term of the modularity objective corresponds to a hyperedge cut penalty that is known in the scientific computing literature as "connectivity − 1" [Deveci et al., 2015], the K − 1 metric [Karypis and Kumar, 2000], or the boundary cut [Hendrickson and Kolda, 2000]. It has also been called fanout in the database literature [Kabiljo et al., 2017]. The related Sum of External Degrees penalty [Karypis and Kumar, 2000] is also a special case of the Group Number affinity. The Relative Plurality affinity considers only the relative difference between the size of the largest group represented in an edge and the next largest group. This rather specialized affinity function is especially appropriate in contexts where groups are expected to be roughly balanced, as we find, for example, in party affiliations in Congressional committees. Finally, the Pairwise affinity counts the number of pairs of nodes within the edge whose clusters differ. While this affinity function uses similar information to that used in dyadic graph models, there is no immediately apparent equivalence between any dyadic random graph and a DCHSBM with the Pairwise affinity function. There are many more symmetric affinity functions; see Table 3 of Veldt et al. [2020a] for several other splitting functions which can be used to define affinities.
An important subtlety was recently raised by Zhang and Peixoto [2020] concerning the relationship between blockmodels and modularity in Newman [2016], which also applies to our derivation of (14) above and (15) below. We derived the conditional maximum-likelihood estimates (10) and (11) of θ and Ω under the assumption of a general, unconstrained affinity function Ω. It is not guaranteed that these same estimates maximize the likelihood when additional constraints-such as the symmetry constraint Ω(z) = Ω(φ(z))-are imposed. Indeed, as Zhang and Peixoto [2020] show for the case of dyadic graphs, (10) and (11) for estimatingθ andΩ are only exact under the symmetry assumption on Ω when vol( ) is constant for each ∈ [ˆ ]. When the sizes of groups vary, as is typical in most data sets, (10) and (11) are instead approximations of the exact conditional maximum-likelihood estimates. The situation is reminiscent of the tendency of the graph modularity objective to generate clusters of approximately equal sizes [Gleich and Mahoney, 2016]. The objectives and algorithms we develop below should therefore be understood as approximations to coordinate-ascent maximum likelihood inference, which are exact only in the case that all clusters have equal volumes. See Zhang and Peixoto [2020] for a detailed discussion of these issues in the context of dyadic graphs.

ALL-OR-NOTHING MODULARITY
All-Or-Nothing affinity function is of special interest for modeling and computation. This affinity function is a natural choice for systems in which the occurrence of an interaction or relationship depends strongly on group homogeneity.
Inserting the All-Or-Nothing affinity function from Table 1 into (14) yields, after some algebra (Appendix F), the objective where , and J(ω) collects terms that do not depend on the partition z. We collect {β k } and {γ k } into vectors β, γ ∈ Rk. We have also defined In this expression, m k is the (weighted) number of hyperedges of size k, i.e., m k = R∈R k a R . The cut terms cut k (z) thus count the number of hyperedges of size k which contain nodes in two or more distinct clusters. This calculation is a direct generalization of that from Newman [2016] for graph modularity. Indeed, we recover the standard dyadic modularity objective by restricting to k = 2. We call (15) the All-Or-Nothing (AON) hypergraph modularity.
[2019] proposed a "strict modularity" objective for hypergraphs. This strict modularity is a special case of (15), obtained by choosing ω k0 and ω k1 such that β k = 1 and γ k = m k (vol(H)) k , where vol(H) = n i=1 d i is the sum of all node degrees in hypergraph H. However, leaving these parameters free lends important flexibility to our proposed AON objective (15). Tuning β allows one to specify which hyperedge sizes are considered to be most relevant for clustering. In email communications, for example, a very large list of recipients may carry minimal information about their social relationships, and it may be desirable to down-weight large hyperedges by tuning β. Tuning γ has the effect of modifying the sizes of clusters favored by the objective, in a direct generalization of the resolution parameter in dyadic modularity [Reichardt and Bornholdt, 2006;Veldt et al., 2018]. Importantly, it is not necessary to specify the values of these parameters a priori ; instead, they can be adaptively estimated via (11).

hypergraph maximum-likelihood louvain
In order to optimize the modularity objectives (14) and (15), we propose a family of agglomerative clustering algorithms. These algorithms greedily improve the specified objective through local updates to the node label vector z. The structure of these algorithms is based on the widely used and highly performant Louvain heuristic for graphs [Blondel et al., 2008]. The standard heuristic alternates between two phases. In the first phase, each node begins in its own singleton cluster. Then, each node i is visited and moved to the cluster of the adjacent node j which maximizes the increase in the objective Q. If no such move increases the objective, then i's label is not changed. This process repeats until no such moves exist which increase the objective. In the second phase, a "supernode" is formed for each label. The supernode represents the set of all nodes sharing that label. Then, the first phase is repeated, generating an updated labeling of supernodes, which are then aggregated in the second phase. The process repeats until no more improvement is possible. Since every step in the first phase improves the objective, the algorithm terminates with a locally optimal cluster vector z.
This heuristic generalizes naturally to the setting of hypergraphs. However, the incorporation of heterogeneous hyperedge sizes and general affinity functions considerably complicates implementation. Here we provide a highly general Hypergraph Maximum-Likelihood Louvain (HMLL) algorithm for optimizing the symmetric modularity objective (14). For the important case of the All-Or-Nothing (AON) affinity, the simplified objective (15) admits a much simpler and faster specialized Louvain algorithm, which we describe in Appendix G. As we show in subsequent experiments, this specialized algorithm is highly scalable and effective in recovering ground truth clusters in data sets with polyadic structure plausibly modeled by the AON affinity.

SYMMETRIC HYPERGRAPH MAXIMUM-LIKELIHOOD LOUVAIN
The first phase of our Symetric HMLL algorithm mirrors standard graph Louvain: nodes begin in singleton clusters and in turn greedily move to adjacent clusters until no more improvement is possible. Phase two of graph Louvain reduces edges between clusters into weighted edges between supernodes in a structure-preserving way. However, in the hypergraph case, naively collapsing clusters and hyperedges would discard important information about hyperedge sizes and the way each hyperedge is partitioned across clusters. Therefore, in subsequent stages of our algorithm, we greedily improve the objective by moving entire sets of nodes in the original hypergraph, rather than greedily moving individual nodes. In this way, a set of nodes that was assigned to the same cluster in a previous iteration is essentially treated as a supernode and moved as a unit, without collapsing the hypergraph and losing needed information about hyperedge structure.
Our overall procedure is formalized in Algorithms 1 and 2. Algorithm 1 visits in turn each set of nodes S c that represents a cluster c from a previous iteration. The algorithm evaluates the change ∆Q in the objective function Q associated with moving all of S c to an adjacent cluster, and then carries out the move that gives the largest positive change to the objective. This is repeated until moving a set S c can no longer improve the objective. The entire Symetric HMLL method is summarized by running Algorithm 2, which starts by placing every node in a singleton cluster before calling Algorithm 1 for the first time.
Algorithm 1 relies on a function ∆Q which computes the change in the objective Q associated with moving all nodes from S c to an adjacent cluster. Changes to the second (volume) term in the objective can be computed efficiently using combinatorial identities (Appendix E, Proposition 1). Changes to the first (cut) term require summing across all hyperedges incident to a node or set of nodes. At each hyperedge, we must evaluate the affinity Ω(p) on the current partition p, as well as the affinity Ω(p ) associated with the candidate updated partition p . This situation contrasts with the case of the graph Louvain algorithm, in which it is sufficient to check whether a given edge joins nodes in the same or different clusters. The fact that we need to store and update the partition vector p for each hyperedge is what prevents us from collapsing a cluster of nodes into a monolithic supernode and recursively applying Algorithm 2 on a reduced data structure, as customary in graph Louvain. Thus, while clusters of nodes move as a unit in Algorithm 1 as well, it is necessary in this case to operate on the full adjacency data A at all stages of the algorithm. This can make Algorithm 2 slow on hypergraphs of even modest size. Developing efficient algorithms for optimizing the general symmetric modularity objective or various special cases is an important avenue of future work.

ALL-OR-NOTHING HYPERGRAPH MAXIMUM-LIKELIHOOD LOUVAIN
When Ω is the All-Or-Nothing affinity function, considerable simplification is possible. For each edge we need not compute the full partition vector p but only check whether or not p 0 = 1. Rather than a general affinity function Ω, we instead supply the parameter vectors β and γ appearing in (15). This allows us to compute on considerably simplified data structures. In particular, we are able to follow the classical Louvain strategy of collapsing clusters into single, consolidated "supernodes," and restrict attention to hyperedges that span multiple supernodes. Because we do not need to track the precise way in which the hyperedges span multiple supernodes, we can forget much of the original adjacency data A and instead simply store the edge sizes of the hypergraph. These simplifications enable both significant memory savings and very rapid evaluation of the objective update function ∆Q. We provide pseudocode for exploiting these simplifications in Appendix G.

NUMBER OF CLUSTERS
Like most Louvain-style modularity methods, the user cannot directly control the number of clusters returned by HMLL. One approach to influence the number of clusters is to manually set values for the affinity function Ω or the parameters β and γ (if using the All-Or-Nothing affinity). Rather than inferring these parameters from data, one can set them a priori and perform a single round of optimization over z. This approach generalizes the common treatment of the resolution parameter in dyadic modularity maximization as a tuning hyperparameter rather than a number to be estimated from data [Reichardt and Bornholdt, 2006]. Considerable experimentation may be required in order to obtain the desired number of clusters, and retrieving an exact number may not be possible.
Another approach to influencing the number of clusters is to impose a Bayesian prior on the community labels. In the simplest version of a Bayesian approach, one assumes that each node is independently assigned one of¯ labels with equal probability, prior to sampling edges. The probability of realizing a given label vector z is then¯ −n , which generates a term of the form −n log¯ in the loglikelihood L. This term may then be incorporated into Louvain implementations, with the result that greedy moves which reduce the total number of clusters¯ are strongly incentivized. The resulting regularized algorithm may then label vectors z with slightly smaller numbers of distinct clusters. This can be useful when it is known a priori that the true number of clusters in the data is small. Our implementation of AON HMLL incorporates this optional regularization term. We use this term only in the synthetic detectability experiments presented in Figure 2. Index, and number of clusters returned by GMLL and HMLL in a synthetic testbed with optimal affinity parameters. The within-cluster edge placement probabilities are p 2 = 3/5, p 3 = 1/n 3 , and p 4 = 1/n 4 . We also show in light gray the results obtained by using GMLL as a preprocessing step, whose output partition is then refined by HMLL (light gray). 5 experiments with synthetic data 5.1 RUNTIME Dyadic Louvain algorithms are known for being highly efficient in large graphs. Here, we show that AON HMLL can achieve similar performance on synthetic data to Graph MLL (GMLL), a variant of the standard dyadic Louvain algorithm in which we return the combination of resolution parameter and partition that yield the highest dyadic likelihood. For a fixed number of nodes n, we consider a DCHSBM-like hypergraph model with¯ = n/200 clusters and m = 10n hyperedges with size k uniformly distributed between 2 and 4. Each k-edge is, with probability p k , placed uniformly at random on any k nodes within the same cluster. Otherwise, with probability 1 − p k , the edge is instead placed uniformly at random on any set of k nodes. We use this model rather than a direct DCHSBM to avoid the computational burden of sampling edges at each k-tuple of nodes, which is prohibitive when n is large. For the purpose of performance testing, we compute estimates of the parameter vectors β and γ (in the case of AON HMLL) and the resolution parameter γ (in the case of GMLL) using ground truth cluster labels. We emphasize that this is typically not possible in practical applications, since the ground truth labels are not known. We make this choice in order to focus on a direct comparison of runtimes of each algorithm in a situation in which both can succeed. In later sections, we study the ability of HMLL and GMLL to recover known groups in synthetic and empirical data when affinities and resolution parameters are not known. Figure 1 shows runtime, Adjusted Rand (ARI) and number of clusters returned on synthetic hypergraphs when p 2 = 0.6, p 3 = 1/n 3 and p 4 = 1/n 4 . These parameter are chosen so that hyperedges of size three and four are rarely (if ever) contained completely inside clusters. Thus, hyperedges of different sizes provide different signal regarding ground truth clusters. For this experiment, we implemented Graph MLL by computing a normalized clique projection, in which nodes are joined by weighted dyadic edges with weights We also performed experiments on an unnormalized clique projection with w ij = |e : i, j ∈ e|, but do not show these results because, in this experiment, the associated MLL algorithm consistently fails to recover labels correlated with the planted clusters.
On smaller instances, HMLL outperforms Graph MLL in recovering planted clusters, as measured by the ARI. For larger instances, the recovery results are comparable. Interestingly, although GMLL and HMLL obtain similar accuracy in this experiment, they do so in different ways, with HMLL tending to generate more, smaller clusters than GMLL. Importantly, the runtimes are nearly indis-tinguishable, indicating that dyadic clique projections are necessary neither for accuracy nor for performance. We observed other choices of the parameters p 2 , p 3 , and p 4 in which HMLL substantially outperformed GMLL in cluster recovery and vice versa; however, in each case the algorithms' respective runtimes tended to differ by only a small constant factor.
Interestingly, in this synthetic experiment, a combination of the two algorithms leads to the strongest recovery results. In addition to running each algorithm independently, we also ran a two-stage algorithm in which GMLL is used to generate an intermediate partition, and then HMLL is used to refine it. This procedure is similar to the warmstart approach from Kamiński et al. [2020]. We emphasize again that these results are obtained on synthetic hypergraphs with preoptimized affinity parameters, and so the effectiveness of the refinement strategy may not generalize to real data sets. Indeed, in the experiments on empirical data shown in Section 6, we do not show results from the refinement procedure because the output partition was in each case essentially indistinguishable from the output of the dyadic partition. This may reflect the fact that we did not allow the algorithms to learn a priori the affinity parameters associated with the true data labels. Further investigation into the performance of hybrid strategies would be of considerable practical importance.

DYADIC PROJECTIONS AND THE DETECTABILITY THRESHOLD
Informally, an algorithm is able to detect communities in a random graph model with fixed labels z when the output labelingẑ of that algorithm is, with probability bounded above zero, correlated with z. Using arguments from statistical physics, Decelle et al. [2011] conjectured the existence of a regime in the graph stochastic blockmodel in which no algorithm can successfully detect communities. This conjecture has since been refined and proven in various special cases; see Abbe [2017] for a survey. In the dyadic stochastic blockmodel with two equal-sized planted communities, a necessary condition for detectability in the large-graph limit is where c i is the mean number of within-cluster edges attached to a node, and c o is the mean number of between-cluster edges attached to a node. If this condition is not satisfied, no algorithm can reliably detect communities in the associated graph stochastic blockmodel, even though the communities are statistically distinct. This bound limits direct inferential methods, such as Bayesian or maximumlikelihood techniques, as well as methods based on maximization of modularity or other graph objectives [Nadakuditi and Newman, 2012]. Several recent papers have considered the detectability problem in the case of uniform hypergraphs Dukkipati, 2014, 2017;Angelini et al., 2016]. In our model, the presence of edges of multiple sizes complicates analysis. Here, we limit ourselves to an experimental demonstration that the regimes of detectability for the graph SBM and our DCHSBM can differ significantly. Figure 2 shows two experiments on a simple DCHSBM with two equal-sized communities of 250 nodes each. The affinity Ω is tuned so that: 1. Each node is incident to, on average, five 2-edges and five 3-edges.
2. A fraction p 2 of 2-edges join nodes in the same cluster, while a fraction 1 − p 2 of 2-edges join nodes in different clusters.
3. A fraction p 3 of 3-edges join nodes in the same cluster, while a fraction 1 − p 3 of 3-edges join nodes in different clusters.
In this experiment only, both GMLL and AON HMLL discussed below use the Bayesian regularization term −n log¯ in the likelihood objective in order to encourage each algorithm to form a relatively small number of clusters. where each node is incident to, on average, five 2-edges and five 3-edges.
(Left) The recovered partitionẑ is obtained from GMLL. (Right) The recovered partition is obtained from AON HMLL (Algorithm 3). The dashed and dotted lines give various detectability thresholds as described in the main text. In each panel, the returned partitionẑ is highest-likelihood partition from 20 alternations between updatingẑ and inference of the affinity parameters.
In this experiment only, we incorporate a regularization term −n log¯ in the modularity objective in order to promote label vectors z with fewer clusters.
In the lefthand panel, we show the Adjusted Rand Index (ARI) of the returned partitionẑ against the true partition z when using the unnormalized variant of GMLL (results for the normalized variant are similar). This choice reflects the fact that the true number of clusters is known and is equal to 2. The dashed and dotted white lines give the boundaries at which (18) holds with equality. The dashed white line gives the detectability threshold for the assortative regime in which nodes are more likely to link with others in the same cluster. Louvain, as an agglomerative algorithm, is well-suited for detecting assortative clusterings, and is able to detect communities in much, but not all, of this regime. The gap between the theoretical threshold and the performance of Louvain reflects the fact that Louvain, as a stagewise greedy algorithm, possesses no optimality guarantees. There is also a disassortative detectable region below the dotted white line. The agglomerative structure of graph-based Louvain causes the algorithm to entirely fail here.
In the righthand panel, we show the same experiment using AON HMLL. The dashed lines q 2 and q 3 give assortative detectability thresholds for hypothetical algorithms which entirely ignore 3-edges and 2-edges, respectively, while the dotted lines r 2 and r 3 give the corresponding disassortative thresholds. HMLL is able to detect the planted partition for a range of parameter values in which GMLL is not. These include the case in which edges of certain sizes are largely between-cluster, as shown in the top left (small p 2 ) and bottom right (small p 3 ). There is a regime (mid and bottom right) in which the algorithm appears to be constrained by the boundary q 2 , suggesting that HMLL is effectively ignoring 3-edges in this regime. As p 3 increases, however, 3-edges become more informative and the partition can be detected for some values p 2 < q 2 (top right). There is also a broad regime (top left) in which the hypergraph algorithm is able effectively to use both 2-and 3-edges to detect clusters, even when 2-edges are largely between-cluster. We also observe some very limited ability of HMLL to detect clusters in the regime in which both 2-edges and 3-edges are between-cluster (bottom left). Since HMLL is again an agglomerative algorithm, its performance for fully disassortative partitions such as these is unreliable at best.
Intriguingly, there are also combinations of p 2 and p 3 in which GMLL is able to detect the planted partition while HMLL is not. This may indicate that the pooling of edges of different sizes implied by the dyadic projection can be useful in some regimes. We note again that neither GMLL or HMLL are optimal inference algorithms. An optimal hypergraph algorithm might significantly extend the detectable regime in the right panel of Figure 2. We pose the development of such algorithms, as well as their analysis, as highly promising avenues for future research. Next, we analyze several hypergraphs derived from empirical data. The first two are hypergraphs of human close-proximity contact interactions [Benson et al., 2018], obtained from wearable sensor data at a primary school [Stehlé et al., 2011] and a high school [Mastrandrea et al., 2015]. Nodes are students or teachers, and a hyperedge connects groups of people that were all jointly in proximity to one another. Node labels identify the classrooms to which each student belongs, and the primary school data also includes a teacher associated to each class. Next, we created two hypergraphs from U.S. Congressional bill cosponsorship data [Fowler, 2006a,b], where nodes correspond to congresspersons, and hyperedges correspond to the sponsor and all cosponsors of a bill in either the House of Representatives or the Senate. We constructed another pair of data sets from the U.S. Congress in the form of committee memberships [Stewart and Woon, 2021]. Each edge is a committee in a meeting of Congress, and each node again corresponds to a member of the House or a senator. A node is contained in an edge if the corresponding legislator was a member of the committee during the specified meeting of Congress. The 103rd through 115th Congresses are represented, spanning the years 1993-2017. There are again separate data sets for House and Senate members. In all congressional data sets, the node labels give the political parties of the members. We also used a hypergraph of Walmart purchases [Amburg et al., 2020], where each node is a product and a hyperedge connects a set of products that were copurchased by a customer in a single shopping trip. Each node has an associated product category label. Finally, we constructed a hypergraph where nodes correspond to hotels listed at trivago.com, and each hyperedge corresponds to a set of hotels whose web site was clicked on by a user of Trivago within a browsing session. This hypergraph was derived from data released for the 2019 ACM RecSys Challenge contest. 5 For each hotel, the node label gives the country in 5 https://recsys.acm.org/recsys19/ challenge/ which it is located. The data sets vary in size in terms of the number of nodes, hyperedges, hyperedge sizes, and node labels ( Table 2).

MODEL COMPARISON AND HIGHER-ORDER STRUCTURE
It is often stated that higher-order features are important for understanding the structure and function of complex networks. It is less often clarified what kinds of higher-order features are relevant for which networks. Generative modeling provides one way to compare different kinds of higher-order structure. In the DCHSBM, this structure is specified by the affinity function Ω. Comparison of the likelihood functions obtained by each affinity can indicate which one is most plausible as a higher-order generative mechanism of the underlying data. We performed such a comparison using the symmetric affinity functions from Table 1 and the labels for nodes described above. In this setup, we can compute an approximate ML estimate for Ω, given its functional form. In order to make concrete comparisons, it is necessary to specify the functional forms of the Group Number (GN), Relative Plurality (RP), and Pairwise (P) affinities. We use the following parameterizations: The Group Number affinity function assigns a separate parameter to each combination of edge size and number of groups. The Relative Plurality affinity function assigns one parameter for the case that the difference between the largest and second largest groups within an edge exceeds k/4, where k is the size of the edge. The Pairwise affinity function assigns one parameter to the case that the total number of dyadic pairs in differing groups exceeds half the possible number of such pairs. RP, which favors edges which the two most common labels are roughly balanced in representation, is qualitatively distinct from AON, GN, and P, all of which favor edges with homogeneous cluster labels. Because these affinity functions possess different numbers of parameters, we compare them via the Bayesian Information Criterion (BIC) [Schwarz, 1978], which penalizes affinity functions with more parameters than are supported by the data. In computing the BIC, we exclude the n parameters θ, as these are the same in each model and therefore contribute an unimportant additive constant. The AON, RP, and P affinities each have 2k parameters. In the case of GN, we compute the number of possible parameters for each edge size k by computing the number of possible groups using the number of distinct labels in the given partition. For example, if the given partition contained only three distinct groups, then we do not posit parameters corresponding to edges containing more than three groups. It would also be reasonable to remove this restriction, in which case there would be k parameters for edges of size k regardless of z. Table 3 shows the BIC for the DCHSBM using each of these affinity functions. Importantly, no single affinity function is preferred across all of the study data sets, suggesting the presence of different kinds of polyadic structure. In the two congressional committee data sets, RP achieves the optimal BIC, while in each of the other data sets, one of the three affinities that promotes edge homogeneity is instead preferred. There are also important differences between these three affinities. In house-bills, the Pairwise affinity function achieves the lowest BIC overall, while in walmart-purchases the Pairwise affinity is preferred over all but the Group Number affinity. This suggests that a model involving only pairwise comparison of node labels can provide relatively strong generative explanations of the data in these cases. This in turn suggests that dyadic algorithms may perform at least as well on these data sets as their polyadic counterparts. Indeed, as we will see below, in both of these data sets, dyadic algorithms can return clusterings more correlated with ground truth than those returned by AON HMLL.

RECOVERING CLASSES IN CONTACT HYPERGRAPHS
To test the AON HMLL algorithm itself, we first study its behavior in the contact-primary-school and contact-high-school networks. The comparison of BIC scores from Table 3 suggests that GN may be the most explanatory model of the data, but we instead use AON in order to take advantage of its considerable computational benefits. We performed 20 alternations between AON HMLL and estimation of the AON parameters, and returned the partition with the highest DCHSBM likelihood. We compare the results to two dyadic methods. Each step of the Graph Louvain algorithm alternates between using the standard Louvain    Table 1. Lower BIC indicates a more plausible model. The affinity function achieving the lowest BIC in each data set is shown in bold.
algorithm [Blondel et al., 2008] to infer clusters and estimating the resolution parameter γ using the approximate maximum-likelihood framework of [Newman, 2016]. Graph Louvain returns the partition which maximizes the classical dyadic modularity objective. We also compare to Graph Maximum-Likelihood Louvain (GMLL), which carries out the same alternation, but instead returns the partition that maximizes the approximate log-likelihood of the corresponding planted partition stochastic blockmodel. Figure 3 compares the performance of each of these algorithms. In the case of contact-primary-school, we consider the ground truth partition to be the one that assigns exactly one teacher to each class. Graph Louvain is able to find partitions of students with clear correlations with the given class labels, but conflates two primary school classes and splits several high school classes (left column, top two rows). Graph MLL is able to perfectly recover the primary school student class labels, and misclassifies three high school students. Our proposed AON HMLL is able to correctly recover the given partitions in both data sets.
We can obtain some qualitative insight into the behavior of HMLL by studying the structure of the inferred affinity function Ω. The most intuitive way to do so is through the derived parameters β k and γ k from eq. (15). The bottom row of Figure 3 shows these parameters, as well as the distribution of edge sizes. The dependence of β k on edge size k provides one explanation of why Graph MLL succeeds in contact-primary-school but makes several errors in contact-high-school. Under the standard dyadic projection, a k-hyperedge generates k 2 2-edges, and therefore appears in the dyadic modularity objective k 2 distinct times. In the case of contact-primary-school, the estimated importance parameter β k is indeed relatively close to k 2 (bottom center panel of Figure 3). At the optimal partition, the relative weights of edges are therefore distorted relatively little by the clique projection. On the other hand, the estimates for β k in contact-high-school deviate considerably from k 2 , especially for k = 2, 3. Here, small edges feature much more prominently in the polyadic modularity objective than they do in the projected dyadic objective, implying that the latter is a poorer approximation to the former near the optimal partition. This difference may explain the small errors in Graph MLL in contact-high-school. The bottom-right panel of Figure 3 compares the inferred value of the size-specific resolution parameter γ k to γ k0 = m k /vol(H) k , the implicit value used in [Kamiński et al., 2019]. The inferred resolution parameters are consistently larger γ k0 and increase with k, highlighting the value of adaptively estimating these parameters in our approach.

CLUSTER RECOVERY WITH LARGE HYPEREDGES
In Figure 4, we study the ability of AON HMLL to recover ground truth communities in several more of our study data sets. Unlike the two contact networks, FIGURE 3 -Comparison of clustering algorithms in contact-primary-school and contact-high-school. For each data set, we show a partition obtained from the classical graph Louvain modularity maximization heuristic; a partition obtained from Graph Maximum-Likelihood Louvain (GMLL); and partition obtained by AON HMLL. The partition shown is the one which attains the corresponding objective function after 20 rounds of iterative likelihood maximization. Each box records the number of agents with the specified combination of inferred cluster and ground truth label. The bottom row visualizes the number m k of edges of size k, the inferred size weights β k , and inferred resolution parameters γ k as defined in (15). On the far right, γ k0 = m k /vol(H) k . each of these data sets contains edges of size up to 25 nodes. We have excluded house-committees and senate-committees on the grounds that these data sets are disassortative, indicating that AON is clearly inappropriate. We compare AON HMLL to two variants of GMLL. In the unnormalized variant, we obtain a dyadic graph by replacing each k-edge with a k-clique, thus generating a total of k 2 dyadic edges. In the normalized variant, we weight each edge in the k-clique by a factor of 1 k−1 . The normalized dyadic degree of each node is then equal to its degree in the original hypergraph. In either case, we then alternate between the dyadic Louvain algorithm for estimating clusters and conditional maximum likelihood inference of the resolution parameter γ. In each trial, we perform 20 iterations of AON HMLL as well as the two GMLL variants, returning from these the combination of group labels and parameters that achieves the highest likelihood. We then compare the clustering to the ground truth labels via the Adjusted Rand Index. We vary the maximum edge sizek in order to show how each algorithm responds to the incorporation of progressively larger edges. Because extreme sparsity poses issues for community detection algorithms in general [Abbe, 2017], we show experiments for progressively denser cores of trivago-clicks and walmart-purchases. The c-core of of a hypergraph H is defined as the largest sub-hypergraph H c such that all nodes in H c have degree at least c. The results highlight the strong dependence of the performance of AON HMLL on the relative plausibility of the AON affinity function as a generative mechanism for the data (cf. Table 3). In trivago-clicks, the AON affinity function achieved the lowest BIC of all four candidates. Because AON is a more plausible generating mechanism by this metric, it is not surprising that AON HMLL is able to find partitions considerably more correlated with the supplied data labels than those returned by the dyadic variants. In walmart-purchases, on the other hand, the Pairwise affinity is preferred to AON. In this case, AON HMLL performs much worse, and in the 2-core even returns clusters which are anticorrelated with the supplied labels. As weakly-connected nodes are removed and the resulting data becomes denser, HMLL begins to return correlated clusters. However, the normalized GMLL variant is at least as effective in recovering the data labels.
In the two Congressional bills data sets, the Pairwise affinity achieves a lower BIC than AON in the House and a comparable one in the Senate. Echoing this finding, a dyadic method outperforms AON HMLL in each of these cases. Interestingly, unnormalized GMLL performs best in house-bills and senate-bills, while normalized GMLL is preferable in walmart-purchases. In addition, HMLL is the worst algorithm only in the case of the 2-core of walmart-purchases for small k. HMLL may therefore be the algorithm of choice in cases when it is not known whether normalized or unnormalized dyadic representations are more appropriate for the data.
When interpreting these recovery results, it is important to contextualize them against the limitations of community detection methods in general and of modularity maximization in particular. There is no "best algorithm" for community detection that does not make implicit assumptions about the structure of the data, and mismatch of algorithms to data sets can generate misleading results [Peel et al., 2017]. 6 Even when the data generating process indeed matches 6 That being said, a test similar to BESTest [Peel et al., 2017] reveals that the likelihood under DCHSBM is much greater than the likelihood under random label permutations in most data set configurations, implying significant correlation between network structure and the labels.
algorithmic assumptions-such as a synthetic data set generating from a stochastic blockmodel-optimal algorithms may fail to detect planted communities due to sparsity [Decelle et al., 2011;Abbe, 2017]. Greedy modularity maximization, including the Louvain variants considered here, only finds one of possibly many local optima [Good et al., 2010], some of which may be largely uncorrelated with each other. These considerations imply that (a) we cannot rule out the existence of other local optima which might achieve higher scores in any of the three algorithms and (b) the fact that an algorithm fails to recover a clustering close to the ground truth does not imply that it is "failing" in its stated objective, namely, local likelihood maximization. Overall, our results suggest that, when the assumptions of the DCHSBM with All-Or-Nothing affinity are appropriate to the data, AON HMLL can outperform dyadic approaches in recovering ground truth communities. In practice, since we often do not have access to ground truth labels, the question of whether or not the assumptions are appropriate to the data should be informed by domain expertise.

discussion
We have proposed a generative approach for clustering polyadic data, grounded in a degree-corrected hypergraph stochastic blockmodel (DCHSBM). From this model we have derived a symmetric, modularity-like objective, which includes the All-Or-Nothing (AON) modularity objective as an important special case. This derivation connects hypergraph modularity objectives to concrete modeling assumptions, which can be tuned in response to domain expertise. We have also formulated Louvain-like algorithms for optimizing these objectives, which are highly scalable in the case of the AON affinity function. Embedding this heuristic within an alternating approximate maximum likelihood scheme allows adaptive estimation of both node clusters and affinity parameters. We have shown experimentally that hypergraph algorithms possess markedly different detectability regimes from dyadic algorithms. We have also conducted experiments on empirical data, finding that hypergraph methods are preferred to dyadic ones in data sets where their modeling assumptions are well-founded.
Our work points toward many directions of further research. One of these directions is algorithmic. Our greedy coordinate-ascent framework for inference in the DCHSBM has several important limitations. First, because we rely on an NP-hard optimization step, global maximization of the likelihood is never assured. Second, even exact maximum-likelihood itself is limited as an inference paradigm, as it uses information contained only within a small part of the likelihood landscape. Our method, as an approximation which is exact only when clusters are of roughly equal sizes, may also suffer from estimation bias. Third, the edgewise agglomerative approach embodied by Louvain-style algorithms is limited in applicability to affinity functions that promote homogeneity within edges. Alternative inference paradigms may ameliorate some or all of these limitations. Within the framework of maximum-likelihood inference, directly maximizing a profile likelihood offers an intriguing alternative to coordinate ascent [Bickel and Chen, 2009]. While all maximum-likelihood methods are equivalent insofar as they optimize the same objective function, algorithmic properties such as runtime and propensity to be trapped in undesirable local optima may vary between different approaches. Fully Bayesian treatments [Peixoto, 2019] offer another promising path, although these are sometimes limited in their computational scalability. Variational belief-propagation [Decelle et al., 2011;Zhang and Moore, 2014] provides an intriguing compromise, achieving considerable scalability in exchange for several approximations. Recent work [Angelini et al., 2016] has made progress in this direction, but several questions related to scalability and behavior in nonuniform hypergraphs remain extant. Belief-propagation methods for scalable inference with more general affinity functions would be of particular practical interest.
There are also several important directions of theoretical development. One of these is the question of detectability in the DCHSBM. Because the DCHSBM is more flexible than the dyadic DCSBM, the theory of detectability in this model may be substantially more complex. Another direction concerns the properties of the dyadic modularity objective that extend to the hypergraph modularity objectives discussed here. In addition to its role as a comparison against null models [Newman, 2006] and as a term in the DCSBM likelihood [Newman, 2016], the dyadic modularity also expresses the stability of diffusion processes on graphs [Delvenne et al., 2010] and the energy of discrete surface tensions defined on graphs [Boyd et al., 2019]. Extensions of these properties, or explanations of why they fail to generalize, would be helpful for both theorists and practitioners. S U P P L E M E N TA RY I N F O R M AT I O N Generative Hypergraph Clustering: From Blockmodels to Modularity a unidentifiability in θ and Ω Fix θ and Ω, and fix ∈ [¯ ] and s ∈ R. If we do not provide any additional normalization condition, we can construct a modified vector θ given by Define the modified affinity function is the number of times that label appears in z R . By construction, L(z, Ω, θ) = L(z, Ω , θ ). Thus, we are unable to distinguish a single choice of θ and Ω from the infinite family of possibilities obtained by varying s. The normalization (9) is used to select a single choice of parameters from this family.
b maximum-likelihood estimation of θ Recall that we choose to constrain θ such that where vol( ) = n i=1 d i δ(z i , ), with d i representing the (weighted) number of hyperedges in which node i appears. Letting k i (R) be the number of times nodes i show up in tuple R, we can write In order to prove that the maximum likelihood estimate of θ isθ = d (the formula in Eq. (10)), we will prove the following identity This implies that the second term of Q does not depend on θ once the normalization (9) has been enforced. Since the first term of Q is also independent of θ, we can compute the estimateθ by maximizing K(θ) alone. We write The first-order optimality conditions on θ from the objective (22) with constraint (20) imply that K(θ) is maximized with respect to θ i when we set Thus, the maximum likelihood estimate of θ iŝ θ = d as long as identity (21) holds.
We prove identity (21) by direct calculation, after first introducing some useful notation. Recall that R continues to represent sets of unordered node tuples from a set of n nodes. We will use T to represent the set of ordered node tuples. For a hyperedge size k, let T k specifically refer to the set of ordered sets of k nodes, where again we have allowed repeated nodes. For every T ∈ T , z T is the set of cluster labels for nodes in T , and θ T is a vector of degree parameters. Recall that b R represents the number of ways we can order elements from a set R ∈ R. This is given formally by k k1(R),...,kn(R) , where k i (R) is the number of times that node i appears in R. Thus, every R ∈ R is associated with b R different tuples in T . We then have as was to be shown.
c maximum-likelihood estimation of Ω We now provide details for the maximum-likelihood estimate of ω in Eq. (11). Inserting (21) into the definition of Q yields Focusing on the set Y on which Ω takes constant value ω, we write Q(z, Ω, θ) = y∈Y R∈R δ(z R , y)a R log ω − ω y∈y vol(y) + C , where C captures terms that do not depend on ω. The first-order condition reads Solving for the constant ω yields (11). d derivation of (14) We have where to obtain the third line we have applied the identity (21).
e fast evaluation of volume terms The number U p is an order-corrected version of the moments vol p (z) in which there is a single term for each distinct partition vector p. Let e j be the jth standard basis vector. For each k, let µ k = ¯ =1 vol( ) k ; these moments can be computed in O(nk) time.
Proof. For notational compactness, let v = vol( ). Let r = p 0 be the number of nonzero elements in p. By definition, we have For a given label vector y such that φ(y) = p, there are k p k j=1 (|{h : p h = j}|!) −1 ways to permute the labels of y. The number U p thus has exactly one term of the form k y∈y v y for each equivalence class of cluster label vectors under permutations of indices. We thus obtain the simplification Peeling off the rth term in this product gives We first recognize t1 =··· =tr−1 We next observe that Inserting these identities into (40) completes the proof.
We note that Proposition 1 also gives an efficient recursive update for the numbers U p when moving nodes between clusters. Let z and z be distinct label vectors, and let {µ k } and {µ k } be the moments of volumes evaluated with respect to z and z respectively. Then, applying Proposition 1 twice yields the update ∆U p = U p − U p : ∆U p = (∆µ pr )U p−prer + µ pr (∆U p−prer ) + (∆µ pr )(∆U p−prer ) − r−1 j=1 ∆U p+pr(ej −er) .
We can thus use Proposition 1 to compute the set of numbers {U p } once "from scratch," and then efficiently update these numbers as the label vector changes in the course of optimization.
In order to later simplify terms in the modularity objective, first recall that R k is the set of unordered k-tuples of nodes, and T k is the set of ordered k-tuples.
Let T ⊆ indicate that all nodes from an ordered k-tuple T ∈ T k are in cluster .
Recall that π(v) is the product of entries of a vector v. Then we have Using the fact that θ R = d R , and applying definition of the cut function cut k (z), we derive the simplified form of the all-or-nothing modularity objective: where g all-or-nothing hypergraph maximum-likelihood louvain In Section 4.1, we gave pseudocode for Hypergraph Maximum-Likelihood Louvain (HMLL) for general symmetric affinity functions Ω (Algorithms 1 and 2). For the special case of the All-Or-Nothing affinity function, it is possible to obtain a considerably faster Louvain algorithm. In particular, it is possible to follow the "supernode" strategy of classical dyadic Louvain, enabling to compute on data structures much simpler than the original adjacency hyperarray. Algorithm 3 describes the outer loop of AON HMLL. The overall structure is much the same as in Algorithm 2. In each iteration of the outer loop, we collapse an initial clustering z into a reduced hypergraphH with the function Collapse(H, z), in a way that still preserves the special AON structure. We run a Louvain step on the reduced hypergraph, and then use a function Expand(H, z,z) to translate the clusteringz on the reduced hypergraph into a clustering z on the original hypergraph. If z and z coincide, the Louvain step found no improvement and we terminate. Otherwise, we begin a new iteration, which begins by collapsing the updated clustering to a reduced hypergraph before running the Louvain step.
In constructing the reduced representationH, it is necessary to store only the collapsed hypergraph alongside a vectors of original sizes of each edge. The degreed of each collapsed node is simply the sum of degrees of nodes in the corresponding cluster . The Louvain step itself is given by Algorithm 4. At each stage, we examine a collapsed node i and compute the change in objective associated with changingz toz i →A , wherez is the collapsed label vector obtained by setting z i → A while leaving other entries unchanged. This calculation is contained within the subroutine ∆Q AON