Neural approximation of Wasserstein distance via a
universal architecture for symmetric and factorwise
group invariant functions
Samantha Chen
1
and Yusu Wang
1, 2
1
Department of Computer Science and Engineering, University of California
- San Diego
2
Halıcıo˘glu Data Science Institute, University of California - San Diego
November 20, 2023
Abstract
Learning distance functions between complex objects, such as the Wasserstein dis-
tance to compare point sets, is a common goal in machine learning applications. How-
ever, functions on such complex objects (e.g., point sets and graphs) are often required
to be invariant to a wide variety of group actions e.g. permutation or rigid trans-
formation. Therefore, continuous and symmetric product functions (such as distance
functions) on such complex objects must also be invariant to the product of such group
actions. We call these functions symmetric and factor-wise group invariant functions
(or SFGI functions in short). In this paper, we first present a general neural network
architecture for approximating SFGI functions. The main contribution of this paper
combines this general neural network with a sketching idea to develop a specific and
efficient neural network which can approximate the p-th Wasserstein distance between
point sets. Very importantly, the required model complexity is independent of the sizes
of input point sets. On the theoretical front, to the best of our knowledge, this is the
first result showing that there exists a neural network with the capacity to approximate
Wasserstein distance with bounded model complexity. Our work provides an interest-
ing integration of sketching ideas for geometric problems with universal approximation
of symmetric functions. On the empirical front, we present a range of results showing
that our newly proposed neural network architecture performs comparatively or better
than other models (including a SOTA Siamese Autoencoder based approach). In par-
ticular, our neural network generalizes significantly better and trains much faster than
the SOTA Siamese AE. Finally, this line of investigation could be useful in exploring
effective neural network design for solving a broad range of geometric optimization
problems (e.g., k-means in a metric space).
1
arXiv:2308.00273v2 [cs.LG] 17 Nov 2023
1 Introduction
Recently, significant interest in geometric deep learning has led to a focus on neural network
architectures which learn functions on complex objects such point clouds (Zaheer et al., 2017;
Qi et al., 2017) and graphs (Scarselli et al., 2008). Advancements in the development of neural
networks for complex objects has led to progress in a variety of applications from 3D image
segmentation (Uy et al., 2019) to drug discovery (Bongini et al., 2021; Gilmer et al., 2017).
One challenge in learning functions over such complex objects is that the desired functions
are often required to be invariant to certain group actions. For instance, functions on point
clouds are often permutation invariant with respect to the ordering of individual points.
Indeed, developing permutation invariant or equivariant neural network architectures, as
well as understanding their universal approximation properties, has attracted significant
attention in the past few years; see e.g., (Qi et al., 2017; Zaheer et al., 2017; Maron et al.,
2018, 2019; Segol and Lipman, 2019; Keriven and Peyr´e, 2019; Yarotsky, 2022; Wagstaff
et al., 2019, 2022; Bueno and Hylton, 2021)
However, in many geometric or graph optimization problems, our input goes beyond
a single complex object, but multiple complex objects. For example, the p-Wasserstein
distance W
p
(X, Y ) between two point sets X and Y sampled from some metric space (e.g.,
R
d
) is a function over pairs of point sets. To give another example, the 1-median of the
collection of k point sets P
1
, . . . , P
k
in R
d
can be viewed as a function over k point sets.
A natural way to model such functions is to use product space. In particular, let X denote
the space of finite point sets from a bounded region in R
d
. Then the p-Wasserstein distance
can be viewed as a function W
p
: X × X R. Similarly, 1-median for k point sets can be
modeled by a function from the product of k copies of X to R. Such functions are not only
invariant to permutations of the factors of the product space (i.e. W
p
(X, Y ) = W
p
(Y, X))
but are also invariant or equivariant with respect to certain group actions for each factor. For
p-Wasserstein distance, W
p
is invariant to permutations of the ordering of points within both
X and Y . This motivates us to extend the setting of learning continuous group invariant
functions to learning continuous functions over product spaces which are both invariant to
some product of group actions and symmetric. More precisely, we consider a type of function
which we denote as an SFGI product functions.
Definition 1.1 (SFGI product function). Given compact metric spaces (X
i
, d
X
i
) where i
[k], we define symmetric and factor-wise group invariant (SFGI) product functions as f :
X
1
×X
2
×···X
k
R where f is (1) symmetric over the k factors, and (2) invariant to the
group action G
1
× G
2
× ···× G
k
for some group G
i
acting on X
i
, for each i [1, k].
Contributions. SFGI product functions can represent a wide array of geometric matching
problems including computing the Wasserstein or Hausdorff distance between point sets. In
this paper, we first provide a general framework for approximating SFGI product functions
in Section 3.1. Our primary contribution, described in Section 3.2, is the integration of
this general framework with a sketching idea in order to develop an efficient and specific
SFGI neural network which can approximate the p-Wasserstein distance between point sets
(sampled from a compact set in a nice metric space, such as the fixed-dimensional Euclidean
space). Most importantly, the complexity of our neural network (i.e., number of parameters
2
needed) is independent of the maximum size of the input point sets, and depends on only
the additive approximation error. To the best of our knowledge, this is the first architecture
which provably achieves this property. This is in contrast to many existing universal approx-
imation results on graphs or sets (e.g., for DeepSet) where the network sizes depend on the
size of each graph or point set in order to achieve universality (Maron et al., 2019; Wagstaff
et al., 2022; Bueno and Hylton, 2021). We also provide a range of experimental results in
Section 4 showing the utility of our neural network architecture for approximating Wasser-
stein distances. We compare our network with both a SOTA Siamese autoencoder (Kawano
et al., 2020), a natural Siamese DeepSets network, and the standard Sinkhorn approxima-
tion of Wasserstein distance. Our results show that our universal neural network architecture
produces Wasserstein approximations which are better than the Siamese DeepSets network,
comparable to the SOTA Siamese autoencoder and generalize much better than both to
input point sets of sizes which are unseen at training time. Furthermore, we show that
our approximation (at inference time) is much faster than the standard Sinkhorn approx-
imation of the p-Wasserstein distance at similar error threshholds. Moreover, our neural
network trains much faster than the SOTA Siamese autoencoder. Overall, our network is
able to achieve equally accurate or better Wasserstein approximations which generalize
better to point sets of unseen size as compared to SOTA while significantly reducing
training time. In Appendix B, we provide discussion of issues with other natural choices of
neural network architectures one might use for estimating Wasserstein distances, including
one directly based on Siamese networks, which are often used for metric learning.
All missing proofs / details can be found in the Supplementary materials.
Related work. Efficient approximations of Wasserstein distance via neural networks are
an active area of research. One approach is to use input convex neural networks to approxi-
mate the 2-Wasserstein distance(Makkuva et al., 2020; Taghvaei and Jalali, 2019). However,
for this approach, training is done per pair of inputs and is restricted to the 2-Wasserstein dis-
tance which makes it unsuitable for a general neural network approximation of p-Wasserstein
distances between discrete distributions. This neural approximation method contrasts with
our goal: a general neural network that can approximate the p-Wasserstein distance be-
tween any two point sets in a compact metric space to within ϵ-accuracy. Siamese networks
are another approach for popular approach for learning Wasserstein distances. Typically, a
Siamese network is composed of a single neural network which maps two input instances to
Euclidean space. The output of the network is represented then by
p
-norm between the out-
put embeddings. In (Courty et al., 2017), the authors utilize a Siamese autoencoder which
takes two histograms (images) as input. For their architecture, a single encoder network is
utilized to map each histogram to an embedding in Euclidean space, while a decoder network
maps each Euclidean embedding back to an output histogram. The Kullback-Liebler (KL)
divergence between original histogram and the output histogram (i.e. reconstruction loss)
is used during training to regularize the embeddings and the final estimate of 2-Wasserstein
distance is the
2
norm between the embeddings. The idea of learning Wasserstein distances
via Siamese autoencoders was extended in (Kawano et al., 2020) to point cloud data with
the Wasserstein point cloud embedding network (WPCE) where the original KL reconstruc-
tion loss was replaced with a differentiable Wasserstein approximation between the original
3
point set and a fixed-size output point set from the decoder network. In our subsequent
experiments, we show that our neural network trains much more efficiently and generalizes
much better than WPCE to point sets of unseen size.
Moreover, the concept of group invariant networks was previously investigated in several
works, including (Zaheer et al., 2017; Qi et al., 2017; Maron et al., 2018; Lim et al., 2022;
Deng et al., 2021). For instance, DeepSets (Zaheer et al., 2017) and PointNet (Maron
et al., 2018) are two popular permutation invariant neural network which were shown to
be universal with respect to set functions. In addition to group invariance, there have also
been efforts to explore the notion of invariance with respect to combinations of groups, such
as invariance to both SE(3) and permutation group (Du et al., 2022; Maron et al., 2020)
or combining basis invariance with permutation invariance (Lim et al., 2022). Our work
differs from previous work in that we address a universal neural network which is invariant
with respect to a specific combination of permutation groups that corresponds to an SFGI
function on point sets. In general, we can view this as a subgroup of the permutation group
- encoding the invariance of each individual point set with the symmetric requirement of
the product function corresponds to a specific subgroup of the permutation group. Thus,
previous results regarding permutation invariant architectures such as DeepSets, PointNet
or combinations of group actions (such as (Du et al., 2022)) do not address our setting of
SFGI functions or p-Wasserstein distances.
2 Preliminaries
We will begin with basic background on groups, universal approximation, and Wasserstein
distances.
Groups and group actions A group G is an algebraic structure that consists of a set of
elements and a binary operation that satisfies a specific set of axioms: (1) the associative
property, (2) the existence of an identity element, and (3) existence of inverses for each
element in the set. Given a metric space (X, d
X
), the action of the group G on X is a
function α : G × X X that transforms the elements of X for each element π G. For
each element π G, we will write π · x to denote the action of a group element π on x
instead of α(π, x). For example, if G is the permutation group over [N] := {1, 2, . . . , N},
and X = R
N
, then for any π G, π · x represents the permutation of elements in x R
N
via π i.e. given x = (x
1
, x
2
, . . . , x
N
), π ·x = (x
π(1)
, x
π(2)
. . . , x
π(N)
). A function f : X Y is
G-invariant if for any X X and any π G, we have that f(X) = f(π · X).
Universal Approximation. Let C(X, R) denote the set of continuous functions from a
metric space (X, d
X
) to R. Given two families of functions F
1
and F
2
where F
1
F
2
and
F
1
, F
2
C(X, R), we say that F
1
universally approximates F
2
if for any ϵ > 0 and any
f F
2
, there is a g F
1
such that g f
< ϵ. Different norms on the space of functions
can be used, but we will use L
norm in this paper, which intuitively leads to additive
pointwise-error over the domain of these functions. The classic universal approximation
theorem for multilayer perceptrons (MLPs) (Cybenko, 1989) states that a feedforward neural
4
network with a single hidden layer, using certain activation functions, can approximate any
continuous function to within an arbitrary additive ϵ-error.
Permutation invariant neural networks for point cloud data One of the most pop-
ular permutation invariant neural network models is the DeepSets model defined in (Zaheer
et al., 2017). DeepSets is designed to handle unordered input point sets by first applying a
neural network to each individual element, then using sum-pooling to generate an embedding
for the input data set, and finally, applying a final neural network architecture to the ouput
embedding. Formally, suppose we are given a finite multiset S = {x : x R
d
} (meaning
that an element can appear multiple times, and the number of times an element occurs in S
is called its multiplicity). The DeepSets model is defined as
N
DeepSet
(S) = g
θ
2
X
xS
h
θ
1
(x)
where h
θ
1
and g
θ
2
are neural network architectures. DeepSets can handle input point sets
of variable sizes. It was also shown to be universal with respect to continuous multiset
functions.
Theorem 2.1 ((Zaheer et al., 2017; Amir et al., 2023; Dym and Gortler, 2022)). Assume
the elements are from a compact set in R
k
, and the input multiset size is fixed as N. Let
t = 2kN + 1. Then any continuous multiset function, represented as f : R
k×N
R which is
invariant with respect to permutations of the columns, can be approximated arbitrarily close
in the form of ρ
P
xX
ϕ(x)
, for continuous transformations ϕ : R
k
R
t
and ρ : R
t
R.
While universality for the case when k = 1 was shown using symmetric polynomials, the
case for k > 1 in fact is quite subtle and the proof in (Zaheer et al., 2017) misses key details.
For completeness, we provide a full proof in Appendix E.1 for when the output dimension of
ϕ is t =
k+N
k
. It was recently shown in (Dym and Gortler, 2022; Amir et al., 2023) that the
output dimension of ϕ can be reduced to 2kN + 1, which is the dimension of t which we use
in Theorem 2.1 and subsequent theorems. In both the cases where the output dimension of
ϕ is t =
k+N
k
or t = 2kN + 1, Theorem 2.1 implies that if we want to achieve universality,
the required network size depends on input point cloud size.
Wasserstein distances and approximations. Here we will introduce Wasserstein dis-
tance for discrete measures. Let (X, d
X
) be a metric space. For two weighted point sets
P = {(x
i
, w
i
) : x
i
X,
P
w
i
= 1, i [n]} and Q = {(x
i
, w
i
) : x
i
X,
P
w
i
= 1, i [m]}, we
define the Wasserstein distance between P and Q as
W
p
(P, Q) = min
ΠR
n×m
+
n
Π, D
p
1/p
: Π1 = [w
1
, . . . , w
n
], Π
T
1 = [w
1
, . . . , w
m
]
o
where D R
n×m
+
is the distance matrix with D
i,j
= d
X
(x
i
, x
j
). One can think of these
weighted point sets as discrete probability distributions in (X, d
X
). When p = 1, W
1
is
also commonly known as the Earth Mover’s distance (EMD). Additionally, note that when
p = , W
p
is the same as the Hausdorff distance between points in P and Q with non-zero
5
weight. Computing Wasserstein distances amounts to solving a linear programming prob-
lem, which takes O(N
3
log N) (where N = max{n, m}) time. There have been a number
of methods for fast approximations of Wasserstein distances, including multi-scale and hi-
erarchical solvers (Schmitzer, 2016), and L
1
embeddings via quadtree algorithms (Backurs
et al., 2020; Indyk and Thaper, 2003). In particular, entropic regularization of Wasserstein
distance (Cuturi, 2013), also known as the Sinkhorn distance, is often used as the standard
Wasserstein distance approximation for learning tasks. Unlike Wasserstein distance, the
Sinkhorn approximation is differentiable and can be computed in approximately O(n
2
) time.
The computation time is governed by a regularization parameter ϵ. As ϵ approaches zero,
the Sinkhorn distance approaches the true Wasserstein distance.
3 Learning functions between point sets
We will first present a general framework for approximating SFGI-functions and then show
how this framework along with geometric sketches of our input data enables us to define neu-
ral networks which can approximate p-Wasserstein distances with complexity independent
of input data size.
3.1 A general framework for functions on product spaces
One of the key ingredients in our approach is the introduction of what we call a sketch of
input data to an Euclidean space whose dimension is independent of the size of the input
data.
Definition 3.1 (Sketch). Let δ > 0, a N
+
, and G be a group which acts on X. A (δ, a, G)-
sketch of a metric space (X, d
X
) consists of a G-invariant continuous encoding function
h : X R
a
and a continuous decoding function g : R
a
X such that d
X
(g h(S), S) < δ.
Now let (X
1
, d
X
1
), . . . , (X
m
, d
X
m
) be compact metric spaces. The product space X
1
×···×
X
m
is still a metric space equipped with the following natural metric induced from metrics
of each factor:
d
X
1
×···×X
m
((A
1
, . . . , A
m
), (A
1
, . . . , A
m
)) = d
X
1
(A
1
, A
1
) + ··· + d
X
m
(A
m
, A
m
).
Suppose G
i
is a group acting on X
i
, for each i [m]. In the following result, instead of SFGI
product functions, we first consider the more general case of factor-wise group invariant
functions, namely functions f : X
1
×···×X
m
R such that f is uniformly continuous and
invariant to the group action G
1
× ··· × G
m
.
Lemma 3.2. Suppose f : X
1
× ··· × X
m
R is uniformly continuous and invariant to
G
1
× ··· × G
m
. Additionally, assume that for any δ > 0, (X
i
, d
X
i
) has a (δ, a
i
, G
i
)-sketch
where a
i
may depend on δ. Then for any ϵ > 0, there is a continuous G
i
-invariant functions
ϕ
i
: X
i
R
a
i
for all i [k] and a continuous function ρ : R
a
1
× ··· × R
a
m
R such that
|f(A
1
, A
2
, . . . , A
m
) ρ(ϕ
1
(A
1
), ϕ
2
(A
2
), . . . , ϕ
k
(A
m
))| < ϵ
for any (A
1
, . . . , A
m
) X
1
×···×X
m
. Furthermore, if X
1
= . . . X
2
= ··· = X
m
, then we can
choose ϕ
1
= ϕ
2
= . . . ϕ
m
.
6
Note that a recent result from (Lim et al., 2022) shows that a continuous factor-wise group
invariant function f : X
1
×···X
m
R can be represented (not approximated) by the form
f(v
1
, . . . , v
m
) = ρ(ϕ
1
(v
1
), . . . , ϕ
k
(v
m
)) if there exists a topological embedding from X
i
/G
i
to
Euclidean space. The condition that each quotient X
i
/G
i
has a topological embedding in
fixed dimensional Euclidean space is strong. A topological embedding requires injectivity,
while in a sketch, one can collapse input objects as long as after decoding, we obtain an
approximated object which is close to the input. Our result can be viewed as a relaxation
of their result by allowing our space to have an approximate fixed-dimensional embedding
(i.e., our (δ, a, G)-sketch).
We often consider the case where X = X
1
= ··· = X
m
i.e. f : X × ···X R where
G is a group acting on the factor X. Oftentimes, we require the function to not only be
invariant to the actions of a group G on each individual X but also symmetric with respect
to the ordering of the input. By this, we mean f(A
1
, . . . , A
m
) = f(A
π(1)
, . . . , A
π(m)
) where
π is a permutation on [m]. In other words, we now consider the SFGI product function f
as introduced in Definition 1.1. The extra symmetry requirement adds more constraints to
the form of f. We show that the set of uniformly continuous SFGI product function can be
universally approximated by product function with an even simpler form than Lemma 3.2
as stated in the theorem below.
Lemma 3.3. Assume the same setup as Lemma 3.2 with X = X
1
= ··· = X
m
and G =
G
1
= ··· = G
m
. Assume that X has a (δ, a, G)-sketch. Additionally, suppose f is symmetric;
hence f is a SFGI function. Let t = 2am + 1. Then for any ϵ > 0, there is a continuous
G-invariant function ϕ : X R
t
and a continuous function ρ : R
t
R such that
|f(A
1
, . . . , A
m
) ρ
m
X
i=1
ϕ(A
i
)
| < ϵ
Now suppose we want approximate an SFGI product function, f, with a neural network.
Lemma 3.3 implies that we can approximate ϕ with any universal G-invariant neural network
which embeds our original space X to some Euclidean space R
a
. Then the outer architecture
ρ can be any universal architecture (e.g. MLP). Finding a universal G-invariant neural
network to realize ϕ over a single factor space X is in general much easier than finding a
SFGI neural network, and as we discussed at the end of Section 1, we already know how to
achieve this for several settings. We will show how this idea is at work for approximating
SFGI functions between point sets in the next subsection.
3.2 Universal neural networks for functions between point sets
We are interested in learning symmetric functions between point sets (i.e. any p-Wasserstein
distance) which are factor-wise permutation invariant. In this section, we will show that we
can find a (δ, a, G)-sketch for the space of weighted point sets. This allows us to combine
Lemma 3.3 with DeepSets to define a set of neural networks which can approximate p-th
Wasserstein distances to arbitrary accuracy. Furthermore, the encoding and decoding func-
tions can be approximated with neural networks where their model complexity is independent
of input point set size. Thus, the resulting neural network used to approximate Wasserstein
distance also has bounded model complexity.
7
Set up. Given some metric space (Ω, d
), let X be the set of weighted point sets with at
most N elements. In other words, each S X has the form S = {(x
i
, w
i
) : w
i
R
+
,
P
i
w
i
=
1, x
i
} and |S| N. One can also consider X to be the set of weighted Dirac measures
over Ω. For simplicity, we also sometimes use S to refer to just the set of points {x
i
}
contained within it. We will consider the metric over X to be the p-th Wasserstein distance,
W
p
. We refer to the metric space of weighted point sets over as (X, W
p
).
First, we will show that given a δ, there is a (δ, a, G)-sketch of X with respect to W
p
.
The embedding dimension a depends on the so-called covering number of the metric space
(Ω, d
) from which points are sampled. Given a compact metric space (Ω, d
), the covering
number ν
(r) w.r.t. radius r is the minimal number of radius r balls needed to cover Ω.
As a simple example, consider = [, ∆] R. Given any r, we can cover X with
2∆
r
intervals so ν
(r)
2∆
r
. The collection of the center of a set of r-balls that cover an r-net
of Ω. For a compact set R
d
with diameter D, its covering number ν
(r) is a constant
depending on D, r and d only.
Theorem 3.4. Set d
X
to be W
p
for 1 p < . Let G be the permutation group. For any
δ > 0, let δ
0
=
1
2
p
p
δ/2 and a = ν
(δ
0
) be the covering number w.r.t. radius δ
0
. Then there is
a (δ, a, G)-sketch of X with respect to W
p
. Furthermore, the encoding function h : X R
a
can be expressed as the following where h : Ω R
a
is continuous:
h(S) =
X
xS
h(x). (1)
Proof. Let δ > 0 and let S X be S = {(x
i
, w
i
) :
P
w
i
= 1, x
i
} and |S| N. Given
δ
0
=
1
2
p
p
δ/2 and a = ν
(δ
0
), we know has a δ
0
-net, C, and we denote the elements of C
as {y
1
, . . . , y
a
}. In other words, for any x Ω, there is a y
i
Cd such that d
(x, y
i
) < δ
0
.
First, we will define an encoding function ρ : X R
a
. For each y
i
, we will use a soft
indicator function e
bd
(x,B
δ
0
(y
i
))
and set the constant b so that e
bd
(x,B
δ
0
(y
i
))
is ”sufficiently”
small if d
(x, B
δ
0
(y
i
)) > δ
0
. More formally, we know that lim
b→∞
e
0
= 0 so there is
β R such that for all b > β, e
0
<
δ
p
0
d
p
max
·a
. Set b
0
to be such that b
0
> β. Let
h
i
(x) = e
b
0
d
(x,B
δ
0
(y
i
))
for each i [a]. For a given x Ω, we compute h : Ω R
a
as
h(x) = [h
1
(x), . . . , h
a
(x)]
Then we define the encoding function h : X R
a
as
h(S) =
n
X
i=1
w
i
h(x
i
)
h(x
i
)
1
Note that h(S)
1
= 1 and h is continuous since Wasserstein distances metrize weak con-
vergence. Additionally, since d
(x, B
δ
0
(y
i
)) is the distance from x to the δ
0
-ball around y
i
,
we are guaranteed to have one j where h
j
(x
i
) = 1 so h(x
i
)
1
> 1.
Now, we define a decoding function g : R
a
X as g(v) = {(y
i
,
v
i
v
1
) : i [a]}. In order
to show that g and h yields a valid (δ, a, G)-sketch of X, we must show that g h(S) is
sufficiently close to S under the W
p
distance. First, we know that
W
p
p
(g h(S), S)
n
X
i=1
a
X
j=1
w
1
h
j
(x
i
)
h(x
i
)
1
d(x
i
, y
j
)
p
.
8
Let d
max
be the diameter of Ω. For a given x
i
, we can partition {h
1
(x
i
), . . . , h
a
(x
i
)} into
those where h
j
(x
i
)
δ
p
0
d
p
max
·a
and those where h
j
(x
i
) <
δ
p
0
d
p
max
·a
i.e. {h
j
1
(x
i
), . . . , h
j
k
(x
i
)} and
{h
j
k+1
(x
i
), . . . , h
j
a
(x
i
)} respectively. If h
j
(x)
δ
p
0
d
p
max
·a
, then
e
b
0
d
(x,B
δ
0
(y
i
))
δ
p
0
d
p
max
· a
> e
b
0
δ
0
so d
(x, B
δ
0
(y
i
)) < δ
0
. Then
W
p
p
(g h(S), S)
n
X
i=1
m
X
j=1
w
i
h
j
(x
i
)
h(x
i
)
1
d
(x
i
, y
j
)
p
.
<
n
X
i=1
w
i
k
X
=1
h
j
(x
i
)
h(x
i
)
1
(2δ
0
)
p
+
a
X
=k+1
δ
p
0
d
p
max
· a
d
p
max
since d
(x
i
, y
j
) d
max
n
X
i=1
w
i
(2
p
δ
p
0
+ δ
p
0
) 2
p
p
p
δ/2 ·
1
2
p
+
1
2
p
·
δ
2
<
δ
2
+
δ
2
= δ
Thus, the encoding function h and the decoding function g make up a (δ, a, G)-sketch.
Note that the sketch outlined in Theorem 3.4 is a smooth version of a one-hot encod-
ing. With Theorem 3.4 and Lemma 3.3, we will now give an explicit formulation of an
ϵ-approximation of f via sum-pooling of continuous functions.
Corollary 3.5. Let ϵ > 0, (Ω, d
) be a compact metric space and let X be the space of
weighted point sets equipped with the p-Wasserstein, W
p
. Suppose for any δ, (Ω, d
) has
covering number a(δ). Then given a function f : X × X R that is uniformly continuous
and permutation invariant, there is continuous functions h : Ω R
a(δ)
, ϕ : R
a(δ)
R
a
, and
ρ : R
a
R, such that for any A, B X
f(A, B) ρ
ϕ
X
(x,w
x
)A
w
x
h(x)
+ ϕ
X
(x,w
x
)B
w
x
h(x)

< ϵ
where h, ϕ and ρ are all continuous and a
= 4 · a(δ) + 1.
Due to Eqn. (1), instead of considering the function h which takes a set of points S X
as input, we now only need to model the function h : R
a
, which takes a single point
x S as input. For simplicity, assume that the input metric space (Ω, d
) is a compact set
in some Euclidean space R
d
. Note that in contrast to Lemma 3.3, each h, ϕ and ρ is simply
a continuous function, and there is no further group invariance requirement. Furthermore,
all the dimensions of the domain and range of these functions are bounded values that
depend only on the covering number of Ω, the target additive error ϵ, and independent to
the maximum size N of input points. We can use multilayer perceptrons (MLPs) h
θ
1
, ϕ
θ
2
,
and ρ
θ
3
in place of h, ϕ and ρ to approximate the desired function. Formally, we define the
following family of neural networks:
N
ProductNet
(A, B) = ρ
θ
3
ϕ
θ
2
X
(x,w
x
)A
w
x
h
θ
1
(x)
+ ϕ
θ
2
X
(x,w
x
)B
w
x
h
θ
1
(x)

. (2)
9
In practice, we consider the input to the the neural network h
θ
1
to be a point x along
with its weight w
x
. As per the discussions above, functions represented by N
ProductNet
can
universally approximate SFGI product functions on the space of point sets. See Figure 1 for
an illustration of our universal architecture for approximating product functions on point
sets. As p-Wasserstein distances, W
p
: X × X R, are uniformly continuous with respect
to the underlying metric W
p
, we can apply our framework for the problem of approximating
p-Wasserstein.
Importantly, the number of parameters in N
ProductNet
does not depend on the maximum
size of the point set but rather only on the ϵ additive error and by extension, the covering
number of the original metric space. This is because the encoding function for our sketch
is defined as the summation of single points into R
a
where a is independent of the size of
the input set. Contrast this result with latent dimension in the universality statement of
DeepSets (cf. Theorem 2.1), which is dependent on the input point set size. Note that in
general, the model complexities of the MLPs ρ
θ
3
, ϕ
θ
2
, and h
θ
1
depend on the dimensions
of the domain and co-domain of each function they approximate (ρ, ϕ and h) and the
desired approximation error ϵ. We assume that MLPs are Lipschitz continuous. In our
case, ϕ
θ
2
operates on the sum of h
θ
1
(x)s for all N number of input points in A (or in B).
In general, the error made in hs may accumulate N times, which causes the precision we
must achieve for each individual h
θ
1
(x) (compared to h(x)) to be Θ(ϵ/N). This would have
caused the model complexity of h
θ
1
to depend on N. Fortunately, this is not the case for
our encoding function h. In particular, because our encoding function can be written in
the specific form of a normalized sum of individual points in the set S; i.e, ϕ
θ
2
operates on
P
(x,w
x
)S
w
x
h
θ
1
(x) with
P
x
w
x
= 1, the error accumulated by the normalized sum will be
less than the maximum error from any single point h(x) for x S. Thus, as both the error
for each MLP and the dimension of the domain and co-domain of each approximated function
(ρ, ϕ, h) do not depend on the size of the input point set N, we get that ρ
θ
3
, ϕ
θ
2
and h
θ
1
each
have model complexity independent of the size of the input point set. In short, because the
encoding and decoding functions of our sketch can be approximated with neural networks
with model complexity independent of input point set size, we are able to achieve a low
model complexity neural network which can approximate Wasserstein distance arbitrarily
well. Note that in general, we are not guaranteed to be able to find a sketch which can be
approximated by neural networks which are independent of input size. Finally, this neural
network architecture can also be applied to approximating Hausdorff distance. More details
regarding Hausdorff distance approximation are available in Appendix A. We summarize the
above results in the following corollary.
Corollary 3.6. There is a network in the form of N
ProductNet
which can approximate the
p-Wasserstein distance between point sets to within an additive ϵ-error. The number of
parameters in this network depends on ϵ and the covering number of and not on the size
of each point set.
Additionally, if we replace the sum-pooling with max in N
ProductNet
, there is a network of
such a form where we can also approximate Hausdorff distance between point sets to additive
ϵ accuracy.
10
Siamese
network on
set elements
Final sum-pooling and MLP
Euclidean
embeddings from
sum pooling over
set elements
Siamese
network on
embeddings
Figure 1: Visually representing a neural network which can universally approximate uniformly
continuous SFGI product functions over pairs of point sets.
Exponential dependence on dimension Although the model complexity of N
ProductNet
is independent of the size of the input point set, it depends on the covering number of which,
in turn, can have an exponential dependence on the dimension of Ω. In short, this means
that the model complexity of N
ProductNet
has an exponential dependence on the dimension
of Ω. However, in practice, many machine learning tasks (e.g. 3D point processing) involve
large point sets sampled from low-dimensional space (d = 3). Furthermore, in general, the
covering number of will depend on the intrinsic dimension of rather than the ambient
dimension. For instance, if the input point sets are sampled from a hidden manifold of
dimension d
(where d
which is much lower than the ambient dimension d), then the covering
number would depend only on d
and the curvature bound of the manifold. In many modern
machine learning applications, it is often assumed that the data is sampled from a hidden
space of low dimension (the manifold hypothesis) although the ambient dimension might be
very high.
Using max-pooling instead of sum-pooling. Observe that in the final step of combin-
ing the Euclidean outputs for two point sets A, B X,
P
(x,w
x
)A
w
x
h
θ
1
(x)+
P
(x,w
x
)B
w
x
h
θ
1
(x),
we use the sum of these two components (as in a DeepSet architecture) :
P
(x,w
x
)A
w
x
h
θ
1
(x)
and
P
(x,w
x
)B
w
x
h
θ
1
(x), to ensure the symmetric condition of SFGI product functions. One
could replace this final sum with a final max such as in PointNet. However, to show that
PointNets are able to universally approximate continuous functions F : K R where
K R
a×2
is compact, we need to use a δ-net for K which will also serve as the intermediate
dimension for ϕ. As K [0, N]
a×2
in our case (where N is the maximum cardinality for a
point set), the intermediate dimension for a max-pooling invariant architecture at the end
(i.e. PointNet) now depends on the maximum size of input point sets.
4 Experimental results
We evaluate the accuracy of the 1-Wasserstein distance approximations of our proposed
neural network architecture, N
ProductNet
, against two different baseline architectures: (1) a
Siamese autoencoder known as the Wasserstein Point Cloud Embedding network (WPCE)
11
Table 1: Mean relative error between approximations and ground truth Wasserstein distance
between point sets. The top row for each dataset shows the approximation quality for
point sets with input sizes that were seen at training time; while he bottom row shows the
approximation quality for point sets with input sizes that were not seen at training time.
Note that N
ProductNet
is our model.
Dataset Input size N
ProductNet
WPCE N
SDeepSets
Sinkhorn
noisy-sphere-3
[100, 300] 0.046 ± 0.043 0.341 ± 0.202 0.362 ± 0.241 0.187 ± 0.232
[300, 500] 0.158 ± 0.198 0.356 ± 0.286 0.608 ± 0.431 0.241 ± 0.325
noisy-sphere-6
[100, 300] 0.015 ± 0.014 0.269 ± 0.285 0.291 ± 0.316 0.137 ± 0.122
[300, 500] 0.049 ± 0.054 0.423 ± 0.408 0.508 ± 0.473 0.198 ± 0.181
uniform
256 0.097 ± 0.073 0.120 ± 0.103 0.123 ± 0.092 0.073 ± 0.009
[200, 300] 0.131 ± 0.096 1.712 ± 0.869 0.917 ± 0.869 0.064 ± 0.008
ModelNet-small
[20, 200] 0.084 ± 0.077 0.077 ± 0.075 0.105 ± 0.096 0.101 ± 0.032
[300, 500] 0.111 ± 0.086 0.241 ± 0.198 0.261 ± 0.245 0.193 ± 0.155
ModelNet-large
2048 0.140 ± 0.206 0.159 ± 0.141 0.166 ± 0.129 0.148 ± 0.048
[1800, 2000] 0.162 ± 0.228 0.392 ± 0.378 0.470 ± 0.628 0.188 ± 0.088
RNAseq
[20, 200] 0.012 ± 0.010 0.477 ± 0.281 0.482 ± 0.291 0.040 ± 0.009
[300, 500] 0.292 ± 0.041 0.583 ± 0.309 0.575 ± 0.302 0.048 ± 0.006
(Kawano et al., 2020) (previously introduced at the end of Section 1 and is a SOTA method
of neural approximation of Wasserstein distance) and (2) a Siamese DeepSets, denoted as
N
SDeepSets
, which is a single DeepSets model which maps both point sets to a Euclidean
space and approximates the 1-Wasserstein distance as the
2
norm between output of each
point set. As Siamese networks are widely employed for metric learning, N
SDeepSets
model is
a natural baseline for comparison again N
ProductNet
. We additionally test our neural network
approximations against the Sinkhorn distance where the regularization parameter was set to
ϵ = 0.1. For each model, we record results for the best model according to a hyperparameter
search with respect to the parameters of each model. Finally, we use the ModelNet40 (Wu
et al., 2015) dataset which consists of point clouds in R
3
and a gene expression dataset
(RNAseq) which consists of 4360 cells each represented by 2000 genes (i.e. 4360 points
in R
2000
) as well as three synthetic datasets: (1) uniform, where point sets are in R
2
, (2)
noisy-sphere-3, where point sets are in R
3
, (3) noisy-sphere-6, where point sets are in R
6
.
The RNAseq dataset is publicly available courtesy of the Allen institute (Yao et al., 2021).
Additional details and experiments approximating the 2-Wasserstein distance are available
in Appendix D.
Approximating Wasserstein distances. Our results comparing 1-Wasserstein distance
approximations are summarized in Table 1. Additionally, see Table 5 for a summary of time
needed for training. For most datasets, N
ProductNet
produces more accurate approximations
of Wasserstein distances for both input point sets seen at training time and for input point
sets unseen at training time. For the high dimensional RNAseq dataset, our approxima-
12
Table 2: Training time and number of epochs needed for convergence for best model
Dataset N
ProductNet
WPCE N
SDeepSets
noisy-sphere-3
Time 6min 1hr 46min 9min
Epochs 20 100 100
noisy-sphere-6
Time 12min 4hr 6min 1hr 38min
Epochs 20 100 100
uniform
Time 7min 3hr 36min 1hr 27min
Epochs 23 100 100
ModelNet-small
Time 7min 1hr 23min 12 min
Epochs 20 100 100
ModelNet-large
Time 8min 3hr 5min 40min
Epochs 20 100 100
RNAseq(2k)
Time 15min 14hr 26min 3hr 01min
Epochs 73 100 100
tion remains accurate in comparison with other methods, including the standard Sinkhorn
approximation for input point sets seen at training time. The only exception is ModelNet-
small, where the N
ProductNet
approximation error is slightly larger than WPCE for input
point set sizes using during training (top row for each dataset in Table 1). However, for
point sets where the input sizes were not used during training (bottom row for each dataset
in Table 1), N
ProductNet
showed siginificantly lower error than all other methods including
WPCE. These results along with a more detailed plot in Figure 3 in Appendix D indicate
that N
ProductNet
generalizes better than WPCE to point sets of input sizes unseen at training
time. Also, see Appendix D for additional discussion about generalization. Furthermore,
one major advantage of N
ProductNet
over WPCE is the dramatically reduced time needed for
training (cf. Table 2). This substantial difference in training time is due to WPCE’s usage
of the Sinkhorn reconstruction loss as the O(n
2
) computation time for the Sinkhorn distance
can be prohibitively expensive as input point set sizes grow. Thus, our results indicate that,
compared to WPCE, N
ProductNet
can reduce training time while still achieving comparable
or better quality approximations of Wasserstein distance. Using our N
ProductNet
, we can
produce high quality approximations of 1-Wasserstein distance while avoiding the extra cost
associated with using an autoencoder architecture and Sinkhorn regularization. Finally, all
models produce much faster approximations than the Sinkhorn distance (see Tables 5 and
8 in Appendix D). In summary, as compared to WPCE, our model is more accurate in ap-
proximating both 1-Wasserstein distance, generalizes better to larger input point set sizes,
and is more efficient in terms of training time.
13
5 Concluding Remarks
Our work presents a general neural network framework for approximating SFGI functions
which can be combined with geometric sketching ideas into a specific and efficient neural
network for approximating p-Wasserstein distances. We intend to utilize N
ProductNet
as an
accurate, efficient, and differentiable approximation for Wasserstein distance in down-
stream machine learning tasks where Wasserstein distance is employed, such as loss functions
for aligning single cell multi-omics data (Demetci et al., 2020) or compressing energy profiles
in high energy particle colliders (Di Guglielmo et al., 2021; Shenoy and The CMS Collabo-
ration Team, 2022). Beyond Wasserstein distance, we will look to apply our framework to a
wide array of geometric problems that can be considered SFGI functions and are desireable to
approximate via neural networks. For instance, consider the problems of computing the op-
timal Wasserstein distance under rigid transformation or the Gromov-Wasserstein distance,
which both can be represented as an SFGI function where the factor-wise group invariances
include both permutation and rigid transformations. Then our sketch must be invariant to
both permutations and orthogonal group actions on the left. It remains to be seen if there
is a neural network architecture which can approximate such an SFGI function to within an
arbitrary additive ϵ-error where the complexity does not depend on the maximum size of the
input set.
6 Acknowledgements
The authors thank anonymous reviewers for their helpful comments. Furthermore, the au-
thors thank Rohan Gala and the Allen Institute for generously providing the RNAseq dataset.
Finally, Samantha Chen would like to thank Puoya Tabaghi for many helpful discussions
about permutation invariant neural networks and Tristan Brug`ere for his implementation of
the Sinkhorn algorithm. This research is supported by National Science Foundation (NSF)
grants CCF-2112665 and CCF-2217058 and National Institutes of Health (NIH) grant RF1
MH125317.
References
Tal Amir, Steven J Gortler, Ilai Avni, Ravina Ravina, and Nadav Dym. Neural injective
functions for multisets, measures and graphs via a finite witness theorem. arXiv preprint
arXiv:2306.06529, 2023.
Arturs Backurs, Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Scalable nearest
neighbor search for optimal transport. In International Conference on machine learning,
pages 497–506. PMLR, 2020.
Pietro Bongini, Monica Bianchini, and Franco Scarselli. Molecular generative graph neural
networks for drug discovery. Neurocomputing, 450:242–252, 2021.
Emmanuel Briand. When is the algebra of multisymmetric polynomials generated by the
14
elementary multisymmetric polynomials? Beitr¨age zur Algebra und Geometrie: Contri-
butions to Algebra and Geometry, 45 (2), 353-368., 2004.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard ackinger, and Roopak Shah. Signature
verification using a” siamese” time delay neural network. Advances in neural information
processing systems, 6, 1993.
Christian Bueno and Alan Hylton. On the representation power of set pooling networks.
Advances in Neural Information Processing Systems, 34:17170–17182, 2021.
Nicolas Courty, emi Flamary, and elanie Ducoffe. Learning wasserstein embeddings.
arXiv preprint arXiv:1710.07457, 2017.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances
in neural information processing systems, 26, 2013.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of
control, signals and systems, 2(4):303–314, 1989.
Pinar Demetci, Rebecca Santorella, Bj¨orn Sandstede, William Stafford Noble, and Ritamb-
hara Singh. Gromov-wasserstein optimal transport to align single-cell multi-omics data.
BioRxiv, pages 2020–04, 2020.
Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and
Leonidas J Guibas. Vector neurons: A general framework for so (3)-equivariant net-
works. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 12200–12209, 2021.
Giuseppe Di Guglielmo, Farah Fahim, Christian Herwig, Manuel Blanco Valentin, Javier
Duarte, Cristian Gingu, Philip Harris, James Hirschauer, Martin Kwok, Vladimir Loncar,
et al. A reconfigurable neural network asic for detector front-end data compression at the
hl-lhc. IEEE Transactions on Nuclear Science, 68(8):2179–2186, 2021.
Weitao Du, He Zhang, Yuanqi Du, Qi Meng, Wei Chen, Nanning Zheng, Bin Shao, and
Tie-Yan Liu. Se (3) equivariant graph neural networks with complete local frames. In
International Conference on Machine Learning, pages 5583–5608. PMLR, 2022.
Nadav Dym and Steven J Gortler. Low dimensional invariant embeddings for universal
geometric learning. arXiv preprint arXiv:2205.02956, 2022.
Jean Feydy, Thibault S´ejourn´e, Fran¸cois-Xavier Vialard, Shun-ichi Amari, Alain Trouve,
and Gabriel Peyr´e. Interpolating between optimal transport and mmd using sinkhorn
divergences. In The 22nd International Conference on Artificial Intelligence and Statistics,
pages 2681–2690, 2019.
R´emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur´elie Boisbunon,
Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L´eo
Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko,
Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard,
15
Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine
Learning Research, 22(78):1–8, 2021. URL http://jmlr.org/papers/v22/20-451.html.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl.
Neural message passing for quantum chemistry. In International conference on machine
learning, pages 1263–1272. PMLR, 2017.
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an
invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In 3rd international
workshop on statistical and computational theories of vision, volume 2, page 5, 2003.
Keisuke Kawano, Satoshi Koide, and Takuro Kutsuna. Learning wasserstein isometric em-
bedding for point clouds. In 2020 International Conference on 3D Vision (3DV), pages
473–482. IEEE, 2020.
Nicolas Keriven and Gabriel Peyr´e. Universal invariant and equivariant graph neural net-
works. Advances in Neural Information Processing Systems, 32, 2019.
Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for
one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.
Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and
Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation
learning. arXiv preprint arXiv:2202.13013, 2022.
Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport
mapping via input convex neural networks. In International Conference on Machine
Learning, pages 6672–6681. PMLR, 2020.
Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant
graph networks. arXiv preprint arXiv:1812.09902, 2018.
Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of
invariant networks. In International conference on machine learning, pages 4363–4371.
PMLR, 2019.
Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. On learning sets of symmetric
elements. In International conference on machine learning, pages 6734–6744. PMLR, 2020.
Assaf Naor and Gideon Schechtman. Planar earthmover is not in l 1. SIAM Journal on
Computing, 37(3):804–826, 2007.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-
perative style, high-performance deep learning library. Advances in neural information
processing systems, 32, 2019.
16
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on
point sets for 3d classification and segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 652–660, 2017.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Mon-
fardini. The graph neural network model. IEEE transactions on neural networks, 20(1):
61–80, 2008.
Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of
Mathematical Imaging and Vision, 56(2):238–259, 2016.
Nimrod Segol and Yaron Lipman. On universal equivariant set networks. arXiv preprint
arXiv:1910.02421, 2019.
Rohan Shenoy and The CMS Collaboration Team. EMD Neural Network for HGCAL Data
Compression Encoder ASIC. In APS April Meeting Abstracts, volume 2022 of APS Meeting
Abstracts, page Q09.008, January 2022.
Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex po-
tentials with application to improved training for gans. arXiv preprint arXiv:1902.07197,
2019.
Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung.
Revisiting point cloud classification: A new benchmark dataset and classification model
on real-world data. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 1588–1597, 2019.
Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A Osborne.
On the limitations of representing functions on sets. In International Conference on Ma-
chine Learning, pages 6487–6494. PMLR, 2019.
Edward Wagstaff, Fabian B Fuchs, Martin Engelcke, Michael A Osborne, and Ingmar Posner.
Universal approximation of functions on sets. Journal of Machine Learning Research, 23
(151):1–56, 2022.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and
Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
Zizhen Yao, Cindy TJ van Velthoven, Thuc Nghi Nguyen, Jeff Goldy, Adriana E Sedeno-
Cortes, Fahimeh Baftizadeh, Darren Bertagnolli, Tamara Casper, Megan Chiang, Kirsten
Crichton, et al. A taxonomy of transcriptomic cell types across the isocortex and hip-
pocampal formation. Cell, 184(12):3222–3241, 2021.
Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. Con-
structive Approximation, 55(1):407–474, 2022.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhut-
dinov, and Alexander J Smola. Deep sets. Advances in neural information processing
systems, 30, 2017.
17
A Approximating Hausdorff distance
Suppose the underlying metric for X (the space of weighted point sets) is Hausdorff distance,
d
H
. To clarify, given S
1
= {(x
i
, w
i
) : x
i
R
d
,
P
n
i=1
w
i
} and S
2
= {(y
i
, w
i
: y
i
R
d
,
P
m
i=1
w
i
},
d
H
(S
1
, S
2
) = d
H
({x
1
, . . . , x
n
}, {y
1
, . . . , y
m
}).
To get an encoding/decoding for X equipped with Hausdorff distance using max-pooling,
we will use the construction given in the proof of universality in (Qi et al., 2017), which
we will briefly recap here. As in Lemma 3.4, we consider a δ-net for our original space X,
{y
1
, . . . , y
a
}. For each y
i
, we will again let h
i
(x) = e
d
X
(x,B
δ
(y
i
))
and define h : X R
a
as
h(x) = [h
1
(x), . . . , h
a
(x)]. However, instead of using sum-pooling to define h : X R
a
, we
will use max-pooling as
h(S) = max
xS
h(x)
where max is element-wise max. Then the decoding function g : R
a
X is defined as
g(v) = {(y
i
,
v
i
v
1
) : v
i
1}
Note that v
i
1 when there is an x
i
S such that x
i
B
δ
(y
i
). So for each x
i
S there is
y
i
g h(S) such that d
X
(x
i
, y
i
) < δ. Thus, d
H
(S, g h(S)) < δ.
The final N
ProductNet
for approximating uniformly continuous functions for Hausdorff
distance with max-pooling is
N
max
ProductNet
(A, B) = ρ
θ
3
ϕ
θ
2
max
xA
(h
θ
1
(x))
+ ϕ
θ
2
max
yB
(h
θ
1
(y))
!
.
Note that N
max
ProductNet
is able to approximate Hausdorff distance to ϵ-accuracy.
B Comparison with other neural network architectures
for approximating Wasserstein distance
Through the lens of 1-Wasserstein approximation, we compare the power of N
ProductNet
with
other natural architectures, namely standard MLPs and Siamese networks. The standard
MLP and Siamese networks are the natural points of comparison for N
ProductNet
as (1) MLPs
are the go-to choice in deep learning due to their versatility and efficacy across diverse
problem domains and (2) Siamese networks have gained significant prominence in the context
of metric learning, frequently employed for tasks aimed at learning effective similarity metrics
between pairs of inputs(Hadsell et al., 2006; Bromley et al., 1993; Koch et al., 2015) However,
we discuss the limitations of both the standard MLP and Siamese networks for the taks
of learning Wasserstein distances. For this setting, we will let (X, d
X
) be some compact
Euclidean space, i.e. X R
d
is compact and d
X
is some
p
norm. Then X be the set of
weighted point sets of cardinality at most N i.e. S = {(x
i
, w
i
) : w
i
R
+
,
P
i
w
i
= 1, x
i
X}.
18
Comparison with multilayer perceptrons (MLPs). First, we consider a standard
MLP model where the model takes a single tensor as input. Note that for each S, there is
an
˜
S X such that
˜
S = {(x
i
, w
i
) : (x
i
, w
i
) S} {(x
1
, 0)
×(N−|S|)
}
where (x
1
, 0)
×(N−|S|)
means that we repeat the element (x
1
, 0) (N |S|) times. Notice that
W
p
(S,
˜
S) = 0 and for any S
X, W
p
(S, S
) = W
p
(
˜
S, S
). If |S| = N,
˜
S = S To use an
MLP model to approximate Wasserstein distance, we want to represent any (S
1
, S
2
) X ×X
as an element of R
(d+1)×2N
where S
1
= {(x
i
, w
i
) : w
i
R
+
,
P
i
w
i
= 1, x
i
X} and
S
2
= {(x
i
, w
i
) : w
i
R
+
,
P
i
w
i
= 1, x
i
X}. Informally, given any “empty” element
in R
(d+1)×2N
, we will map S
1
to the first N columns and map S
2
to the last N columns.
Formally, we will define the mapping β : X × X R
(d+1)×2N
given S
1
and S
2
as defined
above as
β(S
1
, S
2
) = β(
˜
S
1
,
˜
S) =
x
1
··· x
N
x
1
··· x
N
w
1
··· w
N
w
1
··· w
N
where we abuse notation slightly and use (x
i
, w
i
) and (x
i
, w
i
) to describe the elements in
˜
S
1
and
˜
S
2
respectively. Given M R
(d+1)×2N
, define β
1
: Image(β) X × X as
β
1
x
1
··· x
N
x
1
··· x
N
w
1
··· w
N
w
1
··· w
N
!
= ({(x
i
, w
i
)}, {(x
i
, w
i
)})
Now, in order to use the classical universal approximation theorem to approximate W
p
to an
ϵ-approximation error, we must show that there is a continuous function f : Image(β) R
such that W
p
(S
1
, S
2
) = f(β(S
1
, S
2
)).
Lemma B.1. Given β : X × X R
(d+1)×2N
as defined previously, there is a continuous
function f : Image(β) R such that for any ϵ > 0, and for any (S
1
, S
2
) X × X,
|W
p
(S
1
, S
2
) f(β(S
1
, S
2
))| < ϵ.
Proof. Let ϵ > 0. From Corollary 3.5, there is a δ > 0 such that are continuous h : X R
a(δ)
,
ϕ : R
a(δ)
R
a
, and ρ : R
a
R, such that for any A, B X
W
p
(A, B) ρ
ϕ
X
xA
w
x
h(x)
+ ϕ
X
xB
w
x
h(x)

< ϵ.
Let (S
1
, S
2
) X × X and let M = β(S
1
, S
2
). Note that β
1
(M) = (S
1
, S
2
). Let M
Image(β), let M[:, i] denote the ith column vector of M, let M[: d, i] denote vector of the
first d elements in the ith column vector of M and let M[i, j] denote the ijth entry of
M. We know that M[:, i] R
d+1
. Since M[:, i] Image(β), we know that M[: d, i] X
and M[d, i] corresponds to a weight. Furthermore, since h is a continuous mapping from
X R, M[d, i]h(M[: d, i]) is continuous. Thus, since ϕ is also continuous, the function
f
: Image(β) R
a
ϕ(
N
X
i=1
M[d, i]h(M[: d, i])) + ϕ(
2N
X
i=N+1
M[d, i]h(M[: d, i]))
19
is continuous. Since ρ is also continuous, we get that ρ f
: Image(β) R is continu-
ous. Thus, f : Image(β) R which is f = ρ f
is also continuous and |f(β(S
1
, S
2
))
W
p
(S
1
, S
2
)| < ϵ by the property from Corollary 3.5
Since f in the above Lemma operates on fixed dimensional space, we can now use an
MLP to approximate it. Then by the above Lemma and universal approximation of MLPs,
there is an MLP which can approximate W
p
: X × X R to an arbitrary ϵ additive
error. However, this approach has the following issues: while an MLP can approximate
W
p
to an arbitrary ϵ additive error, the model complexity of the MLP depends on not
only the approximation error, but also on the maximum size N of the input point set.
In contrast, our neural network N
ProductNet
introduced in Eqn (2) has model complexity
which is independent of N; see Corollary 3.6. Furthermore, the function represented by
the MLP (that we use to approximate ϕ) however may not be symmetric to the two point
sets, and may not guarantee permutation invariance with respect to G × G. In practice,
computationally expensive techniques such as data augmentation are often required in order
for an unconstrained MLP to approximate a structured function such as our Wasserstein
distance function, which is an SFGI-function.
Comparison with Siamese architectures. As mentioned previously, one common way
of learning product functions is to use a Siamese network which embeds X to Euclidean
space and then use some
p
norm to approximate the desired product function. Consider the
learning of Wasserstein distances again. To approximate W
p
(A, B) for two point sets A and
B. The Siamese network will first embed each point set to a fixed dimensional Euclidean
space R
D
via a function ϕ
θ
modeled by a neural network, and then compute ϕ
θ
(A)ϕ
θ
(B)
q
for some 1 q < . Intuitively, compared to our neural network N
ProductNet
introduced in
Eqn (2) (and recall Figure 1), the Siamese network will replace the outer function ρ by simply
the L
q
-norm of ϕ
θ
(A) ϕ
θ
(B). However, this approach intuitively requires that one can find
a near-isometric embedding of the Wasserstein distance to the L
q
distance in a Euclidean
space. However, there exists lower bound results on the distortion incurred when embedding
Wasserstein distance to L
p
space. In the following section, we will review one such result on
the lower bound of metric distortions when embedding the Wasserstein distance to L
1
space;
implying that if we chose q = 1, then in the worst case the Siamese network will incur at
least as much distortion as that lower bound.
C Non-embeddability theorems for Wasserstein dis-
tances
Here we summarize results pertaining to the limitations of embedding Wasserstein distance to
the L
q
distance in a Euclidean space. Consider probability distributions over a grid of points
in R
2
, {0, 1, . . . , D}
2
equipped with the 1-Wasserstein distance, (P({0, 1, . . . , D}
2
), W
1
). Let
L
1
denote the space of Lebesgue measureable functions f : [0, 1] R such that
f
1
=
Z
1
0
|f(t)|dt.
20
Given a mapping F : (P({0, 1, . . . , n}
2
), W
p
) L
1
such that for any µ, ρ (P({0, 1, . . . , D}
2
), W
p
),
W
p
(µ, ρ) F (µ) F (ρ) L · W
p
(µ, ρ) (3)
the distortion is the value of L.
Theorem C.1 ((Naor and Schechtman, 2007)). Any embedding (P({0, 1, . . . , n}
2
, W
1
) L
1
must incur distortion at least Ω(
log D)
For (P({0, 1, . . . , D}
d
), W
1
) where d 2, P({0, 1, . . . , D}
d
contains P({0, 1, . . . , D}
2
) so
the lower bound O(
D) still applies for L
1
embeddings from P({0, 1, . . . , D}
d
), d > 2.
From the above results, we can see that for any Siamese architecture (even in the simple
case of finite point sets in R
2
and R
3
), we are unable to approximate the Wassserstein
distances via any L
1
embedding.
Corollary C.2. Given a neural network N
Siamese
: X R
d
where X are weighted point sets
over {0, 1, . . . , D}
2
, ∥N
Siamese
(µ) N
Siamese
(ν)
1
incurs distortion at least Ω(
log D).
Note that if we consider our input for a Siamese architecture to be finite point sets over
{0, 1, . . . , D}
2
, we allow multisets so the input set size is not bounded by D.
D Experimental details
D.1 Baseline models for comparison with N
ProductNet
Here, we will detail two baseline neural networks in our experiments.
Wasserstein point cloud embedding (WPCE) network First defined by (Kawano
et al., 2020), the WPCE network is an Siamese autoencoder architecture. It consists of an
encoder network N
encoder
and a decoder network N
decoder
. WPCE takes as input two point
sets P, Q R
d
. N
encoder
is a permutation invariant neural network - which we chose to be
DeepSets. In other words,
N
encoder
(P ) = ϕ
θ
2
X
xP
h
θ
1
(x)
.
Note that one may also choose PointNet to be N
encoder
. However, in our experiments, we did
not see a large difference in approximation quality between using PointNet and DeepSets
(where the difference between the two amounts to using a sum vs. max aggregator). For sake
of consistency with N
ProductNet
, we chose to use a sum aggregator for N
encoder
(DeepSets).
The decoder network, N
decoder
, is a fully connected neural network which outputs a fixed-
size point set in R
d
. WPCE then uses a Sinkhorn reconstruction loss term to regularize
the embedding produced by the encoder. Thus, given a set of paired input point sets and
their Wasserstein distance {(P
i
, Q
i
, W
1
(P
i
, Q
i
)) : i [N]} the loss which we optimize over
for WPCE is
L(P, Q) =
1
N
X
∥N
encoder
(P
i
) N
decoder
(Q
i
)
2
W
1
(P
i
, Q
i
)
2
+ λ
1
N
X
S
ϵ
(N
decoder
(N
encoder
(P
i
)) +
1
N
X
S
ϵ
(N
decoder
(N
encoder
(Q
i
))
(4)
21
Table 3: Size of the output point set from the WPCE decoder for each dataset.
noisy-sphere-2 noisy-sphere-6 uniform ModelNet-small ModelNet-large RNAseq
Size 200 200 256 100 2048 100
Table 4: Summary of details for each dataset. Note that min and max refer to the minimum
and maximum number of points per input point set.
dim min max training pairs val pairs
noisy-spheres-2 3 100 300 2000 200
noisy-spheres-6 6 100 300 3600 400
uniform 2 256 256 3000 300
ModelNet-small 3 20 200 3000 300
ModelNet-large 3 2048 2048 4000 400
RNAseq 2000 20 200 3000 300
where λ is some constant in R that controls the balance between the two loss terms and S
ϵ
is the Sinkhorn divergence between P
i
and Q
i
. Note that ϵ in S
ϵ
refers to the regularization
parameter for the Sinkhorn divergence. For our experiments, we chose λ = 0.1 and ϵ = 0.1.
Furthermore, for each dataset used in our experiments, we set used a range of different sizes
for the fixed-size output point set. This parameter is summarized per dataset in Table 3.
Siamese DeepSets. The baseline Siamese DeepSets model, N
SDeepSets
(·, ·), consists of a
single DeepSets model which maps two input point sets to some Euclidean space R
d
. The
2
-
norm between the final embeddings is then used as the final estimate for Wasserstein distance.
Formally, let N
DeepSets
(P ) = ϕ
θ
2
P
xP
h
θ
1
(x)
, where ϕ
θ
2
and h
θ
1
are both MLPs, be the
DeepSets model. Then given two point sets P, Q, the final approximation of Wasserstein
distance given by N
SDeepSets
(P, Q) is
N
SDeepSets
(P, Q) = ∥N
DeepSets
(P ) N
DeepSets
(Q)
2
.
D.2 Training and implementation details
Datasets. We used several synthetic datasets as well as the ModelNet40 point cloud
dataset. Furthermore, we used two different types of synthetic datasets. We construct the
‘noisy-sphere-d dataset by sampling pairs of point clouds from four d-dimensional spheres
centered at the origin with increasing radiuses of 0.25, 0.5, 0.75, and 1.0. For our experiments,
we used ‘noisy-sphere-3’ and ‘noisy-sphere-6’. Finally, the ‘uniform’ dataset of points sets in
R
2
is constructed by sampling points sets from the uniform distribution on [4, 4] ×[4, 4].
The full details and names of each dataset are summarized in Table 4.
Implementation. We implement all neural network architectures using the PyTorch(Paszke
et al., 2019) and GeomLoss (Feydy et al., 2019) libraries. The ground truth 1-Wasserstein
22
distances and Sinkhorn approximations were computed using Python Optimal Transport
(POT) (Flamary et al., 2021). Note that for large point sets and for higher dimensional
datasets, there is often a high degree of numerical instability in the POT implementation
of the Sinkhorn algorithm. In these cases (ModelNet-large and RNAseq) we used our own
implementation of the Sinkhorn algorithm. For each model, we used the Leaky-ReLU as
the non-linearity. To train each model, we set the batch size for each dataset to be 128
and the learning rate to 0.001. All models were trained on an Nvidia RTX A6000 GPU.
For both N
ProductNet
and N
SDeepSets
, given two input point sets, we minimize on the mean
squared error between the approximation produced by the network and the true Wasserstein
distance. In other words, given two point sets P, Q, the loss for N
ProductNet
is defined as
L
N
ProductNet
(P, Q) =
N
ProductNet
(P, Q) W
1
(P, Q)
2
and the loss for Siamese DeepSets is
L
N
SDeepSets
(P, Q) =
N
SDeepSets
(P, Q) W
1
(P, Q)
2
. For WPCE, we train the network using
the loss function defined in Eq. 4. Note that for N
ProductNet
, the hyperparameters are the
width, depth, and output dimension for the MLPs which represent h
θ
1
and ϕ
θ
2
and the width
and depth the MLP which represents ρ
θ
3
. For WPCE, we set the decoder to a three layer
neural network with width 100 and adjusted the width, depth, and output dimension for
the MLPs which represent ϕ
θ
2
and h
θ
1
in N
encoder
. To find the best model for each architec-
ture, we randomly sampled hyperparameter configurations and conducted a hyperparameter
search over 85 models for N
ProductNet
and 75 models for both WPCE and N
SDeepSets
.
D.3 Approximating 2-Wasserstein distance
To further show the use of our model, we additionally approximate the 2-Wasserstein dis-
tance; see Table 7 for results. The experimental set-up is the same as for 1-Wasserstein
distance and we largely see the same trends as we see for 1-Wasserstein distance; that is,
N
ProductNet
outperforms all other neural network implementations. Note that Table 7 shows
that the Sinkhorn approximation with ϵ = 0.01 is more accurate than N
ProductNet
. How-
ever, as the ϵ parameter for the Sinkhorn approximation decreases, the computation time
increases. In Table 8, we show that the Sinkhorn approximation is already much slower
than N
ProductNet
at ϵ = 0.1 while also having a less accurate approximation. The Sinkhorn
approximation with ϵ = 0.01 (reported in Table 7 is slower the Sinkhorn approximation with
ϵ = 0.1 and additionally, is also much slower than N
ProductNet
at inference time.
D.4 Generalization to large point sets
In addition to the results reported in Tables 1 and 7, which record the average relative error
for point sets of unseen sizes during training, we have also included several plots in Figures 3
and 2 which demonstrate how the error for each approximation method changes as the input
point set size increases for both 1-Wasserstein distance and 2-Wasserstein distance. Observe
that for ModelNet-small, noisy-sphere-3, and noisy-sphere-6, the error for N
ProductNet
exhibits
a significantly slower increase as compared to WPCE. The rate at which the error increases is
not as evident for the uniform and ModelNet-large datasets. It is worth mentioning that we
trained the model on fixed-size input for both of these datasets. It is possible that training
with fixed-size input leads a rapid deterioration in approximation quality for WPCE and
23
Table 5: Time for 500 1-Wasserstein distance computations in seconds. Note that we chose
input point set size to be the maximum possible point set size that we trained on. Addi-
tionally, the Sinkhorn distance reported uses ϵ = 0.1 as the regularization parameter. Note
that as ϵ decreases, the error incurred by the Sinkhorn approximation will decrease but the
computation time will also increases. Here, when ϵ = 0.1, the Sinkhorn approximation is
already much slower than the neural network approximations while being less accurate.
Models
Dataset Input size N
ProductNet
WPCE N
SDeepSets
Sinkhorn Ground truth
noisy-sphere-3 300 1.050 0.676 0.4904 2.591 2.813
noisy-sphere-6 300 0.752 0.491 0.503 1.986 6.6770
uniform 256 0.155 0.184 0.137 15.113 1.018
ModelNet-small 200 0.330 0.330 0.191 2.074 1.615
ModelNet-large 2048 1.174 1.571 0.612 239.448 254.947
RNAseq 200 1.128 0.856 0.792 92.153 105.908
N
SDeepSets
when dealing with point sets of sizes not seen at training time. Furthermore,
consider that WPCE may be especially sensitive to differences in input sizes at testing time
as training WPCE depends on minimizing the Wasserstein difference between the input point
set and a fixed-size decoder point set which may cause the model to be overly specialized
to point sets of a fixed input size. This observation could provide an explanation for the
observed plots in both the ModelNet-large and uniform cases. Finally, as predicted in our
theoretical analysis, the performance of the model degrades for higher dimensional datasets
i.e. the RNAseq dataset.
E Extra proofs
E.1 Proof of DeepSets Universality.
Here we will provide extra details on the proof of Theorem 2.1 using multisymmetric poly-
nomials. Note that multisymmetric polynomials were previously used in (Segol and Lipman,
2019) and (Maron et al., 2019) to show universality for equivariant set networks and arbitrary
G-invariant neural networks for any permutation group G.
Proof. To begin, we will first define multisymmetric polynomials and power sum multi-
symmetric polynomials. Let A[y
1
, . . . , y
n
] be the ring of polynomials in n variables with
coefficients in a ring A.
Definition E.1 (Multisymmetric polynomials). The multisymmetric polynomials on the n
families of k variables x
1
. . . , x
n
where x
i
= (x
i,1
, . . . , x
i,k
) are those polynomials that remain
unchanged under every permutation of the n families, x
1
, . . . , x
n
.
Let A be a ring. The algebra of multisymmetric polynomials in n families of k variables
with coefficients in A is denoted J
k
n
(A).
24
(a) ModelNet-small (b) noisy-sphere-3. (c) noisy-sphere-6.
(d) uniform. (e) ModelNet-large. (f) RNAseq.
Figure 2: Average error for 1-Wasserstein approximation for each model as the maximum
number of points increases. Note that the Sinkhorn approximation is the ϵ = 0.1. These
graphs include point set sizes not seen at training time to display how each approximation
performs on unseen examples. Note that especially for ModelNet-small, noisy-sphere-3, and
noisy-sphere-6, we can see that the error for N
ProductNet
increases at a slower rate than
WPCE.
25
Table 6: Comparison of the mean relative error versus overall computation time for 500
approximations of 1-Wasserstein distance for N
ProductNet
and the Sinkhorn distance. Note
that the input point set size is the same as in Table 3 for each dataset. The parameter ϵ
controls the accuracy of the Sinkhorn approximation with lower ϵ corresponding to a more
accurate approximation once the Sinkhorn algorithm converges. However, notice that in some
cases, the Sinkhorn algorithm with ϵ = 0.01 has a higher relative error than the Sinkhorn
algorithm with ϵ = 0.1 as the algorithm fails to converge within a reasonable number of
iterations (1000).
Dataset N
ProductNet
Sinkhorn (ϵ = 0.10) Sinkhorn (ϵ = 0.01)
ModelNet-small
Error 0.084 ± 0.077 0.187 ± 0.232 0.011 ± 0.003
Time (s) 0.330 2.074 104.712
ModelNet-large
Error 0.140 ± 0.206 0.148 ± 0.048 0.026 ± 0.008
Time (s) 1.174 239.448 1930.852
Uniform
Error 0.097 ± 0.073 0.073 ± 0.009 0.023 ± 0.098
Time (s) 0.155 15.113 63.028
noisy-sphere-3
Error 0.046 ± 0.043 0.187 ± 0.232 0.162 ± 0.132
Time (s) 1.050 2.591 214.185
noisy-sphere-6
Error 0.015 ± 0.014 0.137 ± 0.122 0.326 ± 0.135
Time (s) 0.752 1.986 101.763
RNAseq
Error 0.012 ± 0.010 0.040 ± 0.009 0.035± 0.013
Time (s) 1.128 92.153 91.573
(a) ModelNet-small (b) noisy-sphere-3. (c) noisy-sphere-6.
(d) uniform. (e) ModelNet-large. (f) RNAseq.
Figure 3: Average 2-Wasserstein error for each model as the maximum number of points
increases. Note that for all datasets, the Sinkhorn error is with ϵ = 0.10.
26
Table 7: Mean relative error between approximations and 2-Wasserstein distance between
point sets. The top row for each dataset shows the error for point sets with input sizes that
were seen at training time; while the bottom row shows the error for point sets with input
sizes that were not seen at training time. Note that the Sinkhorn approximation is computed
with the regularization parameter set to ϵ = 0.01.
Dataset Input size N
ProductNet
WPCE N
SDeepSets
Sinkhorn
noisy-sphere-3
[100, 300] 0.054 ± 0.071 0.291 ± 0.201 0.400 ± 0.336 0.078 ± 0.186
[400, 600] 0.188 ± 0.197 0.387 ± 0.386 0.427 ± 0.375 0.161 ± 0.311
noisy-sphere-6
[100, 300] 0.024 ± 0.010 0.331 ± 0.237 0.358 ± 0.231 0.019 ± 0.057
[400, 600] 0.092 ± 0.074 0.434 ± 0.598 0.623 ± 0.596 0.050 ± 0.039
uniform
256 0.112 ± 0.082 0.221 ± 0.162 0.241 ± 0.171 0.182 ± 0.044
[200, 300] 0.175 ± 0.123 2.431 ± 2.162 4.058 ± 3.324 0.055 ± 0.053
ModelNet-small
[20, 200] 0.078 ± 0.095 0.178 ± 0.148 0.183 ± 0.148 0.023 ± 0.059
[400, 600] 0.163 ± 0.151 0.216 ± 0.252 0.227 ± 0.179 0.034 ± 0.031
ModelNet-large
2048 0.187 ± 0.335 0.281 ± 0.203 0.538 ± 0.298 0.172 ± 0.065
[1800, 2000] 0.185 ± 0.302 0.523 ± 0.526 33.086 ± 28.481 0.046 ± 0.039
RNAseq
[20, 200] 0.049 ± 0.029 0.508 ± 0.291 0.490 ± 0.271 0.024 ± 0.009
[400, 600] 0.281 ± 0.057 0.533 ± 0.300 0.568 ± 0.317 0.987 ± 0.0002
Furthermore, we define the multisymmetric power-sum polynomials:
Definition E.2 (Multisymmetric power-sum polynomials). Let α = (α
1
, . . . , α
k
) N
k
.
Given x = (x
1
, . . . , x
k
), let
x
α
= x
α
1
1
···x
α
k
k
The multisymmetric power sum with multi-degree α is
p
α
=
n
X
i
x
α
i
Among them, we will consider the set of elementary multisymmetric power sums to be those
with |α| n.
Notice that there t =
n+k
k
multisymmetric power sums. Let α
1
, . . . , α
t
be the list of all
α N
k
such that |α| n.
Theorem E.3 ((Briand, 2004)). Let A be a ring in which n! is invertible. Then the multi-
symmetric power sums generate J
r
n
(A) as an A-algebra.
If we take A = R, we get that the multisymmetric power sum polynomials generate
J
k
N
. Now given a continuous function f : R
k×N
R which is invariant to permutations
of the columns, we know that f can be approximated by a polynomial p which is invariant
to permutations of columns (see (Maron et al., 2019) for a detailed argument). Such a
polynomial p is a multisymmetric polynomial in N families of k variables with coefficients
in R i.e., p J
k
N
. Given x R
k
,
ϕ(x) = [x
α
1
, . . . , x
α
t
]
27
Table 8: Comparison of the mean relative error versus overall computation time for 300
approximations of 2-Wasserstein distance for N
ProductNet
and the Sinkhorn distance. The
parameter ϵ controls the accuracy of the Sinkhorn approximation with lower ϵ corresponding
to a more accurate approximation.
Dataset N
ProductNet
Sinkhorn (ϵ = 0.10) Sinkhorn (ϵ = 0.01)
ModelNet-small
Error 0.078 ± 0.097 0.232 ± 0.132 0.019 ± 0.057
Time (s) 0.208 1.165 9.857
ModelNet-large
Error 0.187 ± 0.335 0.363 ± 0.255 0.172 ± 0.089
Time (s) 2.841 6.079 36.265
uniform
Error 0.112 ± 0.082 0.0303 ± 0.022 0.182 ± 0.044
Time (s) 0.712 16.515 29.312
noisy-sphere-3
Error 0.054 ± 0.071 0.225 ± 0.093 0.078 ± 0.186
Time (s) 0.677 1.591 12.760
noisy-sphere-6
Error 0.024 ± 0.010 0.324 ± 0.316 0.023 ± 0.059
Time (s) 0.428 0.877 9.831
RNAseq
Error 0.049 ± 0.029 0.031 ± 0.014 0.024 ± 0.009
Time (s) 0.716 47.701 81.016
Then
N
X
i=1
ϕ(x
i
) =
P
N
i
x
α
1
i
P
N
i
x
α
2
i
.
.
.
P
N
i
x
α
t
i
=
p
α
1
p
α
2
.
.
.
p
α
t
By Theorem E.3, we have that p
α
1
, . . . , p
α
t
will generate any polynomial in J
k
N
. Then we
have some polynomial q R[y
1
, . . . , y
t
] such that p = q(p
α
1
, . . . , p
α
t
).
E.2 Proofs from Section 3
E.2.1 Proof of Lemma 3.2
Proof. Let ϵ > 0. Since f is uniformly continuous, δ such that for all (A
1
, . . . , A
m
), (A
1
, . . . , A
m
)
X
1
× ··· × X
m
where d
X
1
×···×X
k
((A
1
, . . . , A
m
), (A
1
, . . . , A
m
)) < δ, we have |f(A
1
, . . . , A
m
)
f(A
1
, . . . , A
m
)| < ϵ. Since for any δ > 0 and any i [m], (X
i
, d
X
i
) has a (δ, a, G)-sketch, we
know that there is an a
i
N
+
where there are continuous h
i
: X
i
R
a
i
and g
i
: R
a
i
X
i
where d
X
i
(g
i
h
i
(A), A) < δ/m for each A X
i
.
Let g
: R
a
1
×···×R
a
m
X
i
×···×X
m
be defined as (u
1
, . . . , u
m
) 7→ (g
1
(u
1
), . . . , g
m
(u
m
)).
Since d
X
i
(g
i
h
i
(A
i
), A
i
) < δ/k for A
i
X
i
and i [m],
d
X
1
×···×X
m
((g
1
h
1
(A
1
), . . . , g
m
h
m
(A
m
)), (A
1
, . . . , A
m
)) < δ.
Let ρ = f g
and ϕ
i
= h
i
Then
|f(A
1
, . . . , A
m
) ρ(ϕ
1
(A
1
), . . . , ϕ
k
(A
m
))| = |f(A
1
, . . . , A
m
) f g
(h
1
(A
1
), . . . , h
m
(A
m
))|
= |f(A
1
, . . . , A
m
) f(g h(A
1
), . . . , g h(A
m
))| < ϵ.
28
Note that if X
1
= X
2
= ··· = X
m
and G
1
= G
2
= ··· = G
m
, we can use the same
encoding and decoding function - h
i
and g
i
, respectively - for all X
i
. Thus, in this case,
ϕ
1
= ϕ
2
= ··· = ϕ
m
.
E.2.2 Proof of Lemma 3.3
Proof. Using the same argument as Theorem 3.2, we know that for ϵ/2, there is a continuous
h : X R
a
and g : R
a
R such that
|f(A
1
, . . . , A
m
) f(g h(A
1
), . . . , g h(A
m
))| <
ϵ
2
.
As before, let g
: R
a×m
R be g
(u
1
, . . . , u
m
) = (g(u
1
), . . . , g(u
m
)). Take F : R
a×m
R as
F (u
1
, . . . , u
m
) = f g
(u
1
, . . . , u
m
) = f(g(u
1
), . . . , g(u
m
)) where u
i
represents the ith column
of an element in R
a×m
. Note that F is continuous and invariant to permutations of the
columns. Let t =
a+m
m
. Therefore, by Theorem 2.1, there is a γ : R
a
R
t
and ρ such that
f g
(v
1
, . . . , v
m
) ρ
m
X
i=1
γ(v
i
)
<
ϵ
2
Now set ϕ = γ h which is a function from R
a
R
t
. We thus have that
f(A
1
, . . . , A
m
) ρ
m
X
i=1
ϕ(A
i
)
=
f(A
1
, . . . , A
m
) f g
(h(A
1
), . . . , h(A
m
)) + f g
(h(A
1
), . . . , h(A
m
)) ρ
m
X
i=1
ϕ(A
i
)
<
ϵ
2
+
ϵ
2
= ϵ
This completes the proof.
29