Neural approximation of Wasserstein distance via a universal

Neural approximation of Wasserstein distance via a

universal architecture for symmetric and factorwise

group invariant functions

Samantha Chen

∗1

and Yusu Wang

†1, 2

Department of Computer Science and Engineering, University of California

- San Diego

Halıcıo˘glu Data Science Institute, University of California - San Diego

November 20, 2023

Abstract

Learning distance functions between complex objects, such as the Wasserstein dis-

tance to compare point sets, is a common goal in machine learning applications. How-

ever, functions on such complex objects (e.g., point sets and graphs) are often required

to be invariant to a wide variety of group actions e.g. permutation or rigid trans-

formation. Therefore, continuous and symmetric product functions (such as distance

functions) on such complex objects must also be invariant to the product of such group

actions. We call these functions symmetric and factor-wise group invariant functions

(or SFGI functions in short). In this paper, we ﬁrst present a general neural network

architecture for approximating SFGI functions. The main contribution of this paper

combines this general neural network with a sketching idea to develop a speciﬁc and

eﬃcient neural network which can approximate the p-th Wasserstein distance between

point sets. Very importantly, the required model complexity is independent of the sizes

of input point sets. On the theoretical front, to the best of our knowledge, this is the

ﬁrst result showing that there exists a neural network with the capacity to approximate

Wasserstein distance with bounded model complexity. Our work provides an interest-

ing integration of sketching ideas for geometric problems with universal approximation

of symmetric functions. On the empirical front, we present a range of results showing

that our newly proposed neural network architecture performs comparatively or better

than other models (including a SOTA Siamese Autoencoder based approach). In par-

ticular, our neural network generalizes signiﬁcantly better and trains much faster than

the SOTA Siamese AE. Finally, this line of investigation could be useful in exploring

eﬀective neural network design for solving a broad range of geometric optimization

problems (e.g., k-means in a metric space).

∗

[email protected]

†

[email protected]

arXiv:2308.00273v2 [cs.LG] 17 Nov 2023

1 Introduction

Recently, signiﬁcant interest in geometric deep learning has led to a focus on neural network

architectures which learn functions on complex objects such point clouds (Zaheer et al., 2017;

Qi et al., 2017) and graphs (Scarselli et al., 2008). Advancements in the development of neural

networks for complex objects has led to progress in a variety of applications from 3D image

segmentation (Uy et al., 2019) to drug discovery (Bongini et al., 2021; Gilmer et al., 2017).

One challenge in learning functions over such complex objects is that the desired functions

are often required to be invariant to certain group actions. For instance, functions on point

clouds are often permutation invariant with respect to the ordering of individual points.

Indeed, developing permutation invariant or equivariant neural network architectures, as

well as understanding their universal approximation properties, has attracted signiﬁcant

attention in the past few years; see e.g., (Qi et al., 2017; Zaheer et al., 2017; Maron et al.,

2018, 2019; Segol and Lipman, 2019; Keriven and Peyr´e, 2019; Yarotsky, 2022; Wagstaﬀ

et al., 2019, 2022; Bueno and Hylton, 2021)

However, in many geometric or graph optimization problems, our input goes beyond

a single complex object, but multiple complex objects. For example, the p-Wasserstein

distance W

(X, Y ) between two point sets X and Y sampled from some metric space (e.g.,

) is a function over pairs of point sets. To give another example, the 1-median of the

collection of k point sets P

, . . . , P

in R

can be viewed as a function over k point sets.

A natural way to model such functions is to use product space. In particular, let X denote

the space of ﬁnite point sets from a bounded region in R

. Then the p-Wasserstein distance

can be viewed as a function W

: X × X → R. Similarly, 1-median for k point sets can be

modeled by a function from the product of k copies of X to R. Such functions are not only

invariant to permutations of the factors of the product space (i.e. W

(X, Y ) = W

(Y, X))

but are also invariant or equivariant with respect to certain group actions for each factor. For

p-Wasserstein distance, W

is invariant to permutations of the ordering of points within both

X and Y . This motivates us to extend the setting of learning continuous group invariant

functions to learning continuous functions over product spaces which are both invariant to

some product of group actions and symmetric. More precisely, we consider a type of function

which we denote as an SFGI product functions.

Deﬁnition 1.1 (SFGI product function). Given compact metric spaces (X

, d

) where i ∈

[k], we deﬁne symmetric and factor-wise group invariant (SFGI) product functions as f :

×X

×···X

→ R where f is (1) symmetric over the k factors, and (2) invariant to the

group action G

× G

× ···× G

for some group G

acting on X

, for each i ∈ [1, k].

Contributions. SFGI product functions can represent a wide array of geometric matching

problems including computing the Wasserstein or Hausdorﬀ distance between point sets. In

this paper, we ﬁrst provide a general framework for approximating SFGI product functions

in Section 3.1. Our primary contribution, described in Section 3.2, is the integration of

this general framework with a sketching idea in order to develop an eﬃcient and speciﬁc

SFGI neural network which can approximate the p-Wasserstein distance between point sets

(sampled from a compact set in a nice metric space, such as the ﬁxed-dimensional Euclidean

space). Most importantly, the complexity of our neural network (i.e., number of parameters

needed) is independent of the maximum size of the input point sets, and depends on only

the additive approximation error. To the best of our knowledge, this is the ﬁrst architecture

which provably achieves this property. This is in contrast to many existing universal approx-

imation results on graphs or sets (e.g., for DeepSet) where the network sizes depend on the

size of each graph or point set in order to achieve universality (Maron et al., 2019; Wagstaﬀ

et al., 2022; Bueno and Hylton, 2021). We also provide a range of experimental results in

Section 4 showing the utility of our neural network architecture for approximating Wasser-

stein distances. We compare our network with both a SOTA Siamese autoencoder (Kawano

et al., 2020), a natural Siamese DeepSets network, and the standard Sinkhorn approxima-

tion of Wasserstein distance. Our results show that our universal neural network architecture

produces Wasserstein approximations which are better than the Siamese DeepSets network,

comparable to the SOTA Siamese autoencoder and generalize much better than both to

input point sets of sizes which are unseen at training time. Furthermore, we show that

our approximation (at inference time) is much faster than the standard Sinkhorn approx-

imation of the p-Wasserstein distance at similar error threshholds. Moreover, our neural

network trains much faster than the SOTA Siamese autoencoder. Overall, our network is

able to achieve equally accurate or better Wasserstein approximations which generalize

better to point sets of unseen size as compared to SOTA while signiﬁcantly reducing

training time. In Appendix B, we provide discussion of issues with other natural choices of

neural network architectures one might use for estimating Wasserstein distances, including

one directly based on Siamese networks, which are often used for metric learning.

All missing proofs / details can be found in the Supplementary materials.

Related work. Eﬃcient approximations of Wasserstein distance via neural networks are

an active area of research. One approach is to use input convex neural networks to approxi-

mate the 2-Wasserstein distance(Makkuva et al., 2020; Taghvaei and Jalali, 2019). However,

for this approach, training is done per pair of inputs and is restricted to the 2-Wasserstein dis-

tance which makes it unsuitable for a general neural network approximation of p-Wasserstein

distances between discrete distributions. This neural approximation method contrasts with

our goal: a general neural network that can approximate the p-Wasserstein distance be-

tween any two point sets in a compact metric space to within ϵ-accuracy. Siamese networks

are another approach for popular approach for learning Wasserstein distances. Typically, a

Siamese network is composed of a single neural network which maps two input instances to

Euclidean space. The output of the network is represented then by ℓ

-norm between the out-

put embeddings. In (Courty et al., 2017), the authors utilize a Siamese autoencoder which

takes two histograms (images) as input. For their architecture, a single encoder network is

utilized to map each histogram to an embedding in Euclidean space, while a decoder network

maps each Euclidean embedding back to an output histogram. The Kullback-Liebler (KL)

divergence between original histogram and the output histogram (i.e. reconstruction loss)

is used during training to regularize the embeddings and the ﬁnal estimate of 2-Wasserstein

distance is the ℓ

norm between the embeddings. The idea of learning Wasserstein distances

via Siamese autoencoders was extended in (Kawano et al., 2020) to point cloud data with

the Wasserstein point cloud embedding network (WPCE) where the original KL reconstruc-

tion loss was replaced with a diﬀerentiable Wasserstein approximation between the original

point set and a ﬁxed-size output point set from the decoder network. In our subsequent

experiments, we show that our neural network trains much more eﬃciently and generalizes

much better than WPCE to point sets of unseen size.

Moreover, the concept of group invariant networks was previously investigated in several

works, including (Zaheer et al., 2017; Qi et al., 2017; Maron et al., 2018; Lim et al., 2022;

Deng et al., 2021). For instance, DeepSets (Zaheer et al., 2017) and PointNet (Maron

et al., 2018) are two popular permutation invariant neural network which were shown to

be universal with respect to set functions. In addition to group invariance, there have also

been eﬀorts to explore the notion of invariance with respect to combinations of groups, such

as invariance to both SE(3) and permutation group (Du et al., 2022; Maron et al., 2020)

or combining basis invariance with permutation invariance (Lim et al., 2022). Our work

diﬀers from previous work in that we address a universal neural network which is invariant

with respect to a speciﬁc combination of permutation groups that corresponds to an SFGI

function on point sets. In general, we can view this as a subgroup of the permutation group

- encoding the invariance of each individual point set with the symmetric requirement of

the product function corresponds to a speciﬁc subgroup of the permutation group. Thus,

previous results regarding permutation invariant architectures such as DeepSets, PointNet

or combinations of group actions (such as (Du et al., 2022)) do not address our setting of

SFGI functions or p-Wasserstein distances.

2 Preliminaries

We will begin with basic background on groups, universal approximation, and Wasserstein

distances.

Groups and group actions A group G is an algebraic structure that consists of a set of

elements and a binary operation that satisﬁes a speciﬁc set of axioms: (1) the associative

property, (2) the existence of an identity element, and (3) existence of inverses for each

element in the set. Given a metric space (X, d

), the action of the group G on X is a

function α : G × X → X that transforms the elements of X for each element π ∈ G. For

each element π ∈ G, we will write π · x to denote the action of a group element π on x

instead of α(π, x). For example, if G is the permutation group over [N] := {1, 2, . . . , N},

and X = R

, then for any π ∈ G, π · x represents the permutation of elements in x ∈ R

via π i.e. given x = (x

, x

, . . . , x

), π ·x = (x

π(1)

, x

π(2)

. . . , x

π(N)

). A function f : X → Y is

G-invariant if for any X ∈ X and any π ∈ G, we have that f(X) = f(π · X).

Universal Approximation. Let C(X, R) denote the set of continuous functions from a

metric space (X, d

) to R. Given two families of functions F

and F

where F

⊆ F

and

, F

⊆ C(X, R), we say that F

universally approximates F

if for any ϵ > 0 and any

f ∈ F

, there is a g ∈ F

such that ∥g −f ∥

∞

< ϵ. Diﬀerent norms on the space of functions

can be used, but we will use L

∞

norm in this paper, which intuitively leads to additive

pointwise-error over the domain of these functions. The classic universal approximation

theorem for multilayer perceptrons (MLPs) (Cybenko, 1989) states that a feedforward neural

network with a single hidden layer, using certain activation functions, can approximate any

continuous function to within an arbitrary additive ϵ-error.

Permutation invariant neural networks for point cloud data One of the most pop-

ular permutation invariant neural network models is the DeepSets model deﬁned in (Zaheer

et al., 2017). DeepSets is designed to handle unordered input point sets by ﬁrst applying a

neural network to each individual element, then using sum-pooling to generate an embedding

for the input data set, and ﬁnally, applying a ﬁnal neural network architecture to the ouput

embedding. Formally, suppose we are given a ﬁnite multiset S = {x : x ∈ R

} (meaning

that an element can appear multiple times, and the number of times an element occurs in S

is called its multiplicity). The DeepSets model is deﬁned as

DeepSet

(S) = g



x∈S

(x)



where h

and g

are neural network architectures. DeepSets can handle input point sets

of variable sizes. It was also shown to be universal with respect to continuous multiset

functions.

Theorem 2.1 ((Zaheer et al., 2017; Amir et al., 2023; Dym and Gortler, 2022)). Assume

the elements are from a compact set in R

, and the input multiset size is ﬁxed as N. Let

t = 2kN + 1. Then any continuous multiset function, represented as f : R

k×N

→ R which is

invariant with respect to permutations of the columns, can be approximated arbitrarily close

in the form of ρ



x∈X

ϕ(x)



, for continuous transformations ϕ : R

→ R

and ρ : R

→ R.

While universality for the case when k = 1 was shown using symmetric polynomials, the

case for k > 1 in fact is quite subtle and the proof in (Zaheer et al., 2017) misses key details.

For completeness, we provide a full proof in Appendix E.1 for when the output dimension of

ϕ is t =



k+N



. It was recently shown in (Dym and Gortler, 2022; Amir et al., 2023) that the

output dimension of ϕ can be reduced to 2kN + 1, which is the dimension of t which we use

in Theorem 2.1 and subsequent theorems. In both the cases where the output dimension of

ϕ is t =



k+N



or t = 2kN + 1, Theorem 2.1 implies that if we want to achieve universality,

the required network size depends on input point cloud size.

Wasserstein distances and approximations. Here we will introduce Wasserstein dis-

tance for discrete measures. Let (X, d

) be a metric space. For two weighted point sets

P = {(x

, w

) : x

∈ X,

= 1, i ∈ [n]} and Q = {(x

′

, w

′

) : x

′

∈ X,

= 1, i ∈ [m]}, we

deﬁne the Wasserstein distance between P and Q as

(P, Q) = min

Π∈R

n×m

n

⟨Π, D

⟩



1/p

: Π1 = [w

, . . . , w

], Π

1 = [w

′

, . . . , w

′

]

where D ∈ R

n×m

is the distance matrix with D

i,j

= d

, x

′

). One can think of these

weighted point sets as discrete probability distributions in (X, d

). When p = 1, W

also commonly known as the Earth Mover’s distance (EMD). Additionally, note that when

p = ∞, W

is the same as the Hausdorﬀ distance between points in P and Q with non-zero

weight. Computing Wasserstein distances amounts to solving a linear programming prob-

lem, which takes O(N

log N) (where N = max{n, m}) time. There have been a number

of methods for fast approximations of Wasserstein distances, including multi-scale and hi-

erarchical solvers (Schmitzer, 2016), and L

embeddings via quadtree algorithms (Backurs

et al., 2020; Indyk and Thaper, 2003). In particular, entropic regularization of Wasserstein

distance (Cuturi, 2013), also known as the Sinkhorn distance, is often used as the standard

Wasserstein distance approximation for learning tasks. Unlike Wasserstein distance, the

Sinkhorn approximation is diﬀerentiable and can be computed in approximately O(n

) time.

The computation time is governed by a regularization parameter ϵ. As ϵ approaches zero,

the Sinkhorn distance approaches the true Wasserstein distance.

3 Learning functions between point sets

We will ﬁrst present a general framework for approximating SFGI-functions and then show

how this framework along with geometric sketches of our input data enables us to deﬁne neu-

ral networks which can approximate p-Wasserstein distances with complexity independent

of input data size.

3.1 A general framework for functions on product spaces

One of the key ingredients in our approach is the introduction of what we call a sketch of

input data to an Euclidean space whose dimension is independent of the size of the input

data.

Deﬁnition 3.1 (Sketch). Let δ > 0, a ∈ N

, and G be a group which acts on X. A (δ, a, G)-

sketch of a metric space (X, d

) consists of a G-invariant continuous encoding function

h : X → R

and a continuous decoding function g : R

→ X such that d

(g ◦ h(S), S) < δ.

Now let (X

, d

), . . . , (X

, d

) be compact metric spaces. The product space X

×···×

is still a metric space equipped with the following natural metric induced from metrics

of each factor:

×···×X

((A

, . . . , A

), (A

′

, . . . , A

′

)) = d

, A

′

) + ··· + d

, A

′

Suppose G

is a group acting on X

, for each i ∈ [m]. In the following result, instead of SFGI

product functions, we ﬁrst consider the more general case of factor-wise group invariant

functions, namely functions f : X

×···×X

→ R such that f is uniformly continuous and

invariant to the group action G

× ··· × G

Lemma 3.2. Suppose f : X

× ··· × X

→ R is uniformly continuous and invariant to

× ··· × G

. Additionally, assume that for any δ > 0, (X

, d

) has a (δ, a

, G

)-sketch

where a

may depend on δ. Then for any ϵ > 0, there is a continuous G

-invariant functions

: X

→ R

for all i ∈ [k] and a continuous function ρ : R

× ··· × R

→ R such that

|f(A

, A

, . . . , A

) − ρ(ϕ

), ϕ

), . . . , ϕ

))| < ϵ

for any (A

, . . . , A

) ∈ X

×···×X

. Furthermore, if X

= . . . X

= ··· = X

, then we can

choose ϕ

= ϕ

= . . . ϕ

Note that a recent result from (Lim et al., 2022) shows that a continuous factor-wise group

invariant function f : X

×···X

→ R can be represented (not approximated) by the form

f(v

, . . . , v

) = ρ(ϕ

), . . . , ϕ

)) if there exists a topological embedding from X

Euclidean space. The condition that each quotient X

has a topological embedding in

ﬁxed dimensional Euclidean space is strong. A topological embedding requires injectivity,

while in a sketch, one can collapse input objects as long as after decoding, we obtain an

approximated object which is close to the input. Our result can be viewed as a relaxation

of their result by allowing our space to have an approximate ﬁxed-dimensional embedding

(i.e., our (δ, a, G)-sketch).

We often consider the case where X = X

= ··· = X

i.e. f : X × ···X → R where

G is a group acting on the factor X. Oftentimes, we require the function to not only be

invariant to the actions of a group G on each individual X but also symmetric with respect

to the ordering of the input. By this, we mean f(A

, . . . , A

) = f(A

π(1)

, . . . , A

π(m)

) where

π is a permutation on [m]. In other words, we now consider the SFGI product function f

as introduced in Deﬁnition 1.1. The extra symmetry requirement adds more constraints to

the form of f. We show that the set of uniformly continuous SFGI product function can be

universally approximated by product function with an even simpler form than Lemma 3.2

as stated in the theorem below.

Lemma 3.3. Assume the same setup as Lemma 3.2 with X = X

= ··· = X

and G =

= ··· = G

. Assume that X has a (δ, a, G)-sketch. Additionally, suppose f is symmetric;

hence f is a SFGI function. Let t = 2am + 1. Then for any ϵ > 0, there is a continuous

G-invariant function ϕ : X → R

and a continuous function ρ : R

→ R such that

|f(A

, . . . , A

) − ρ



i=1

ϕ(A

)



| < ϵ

Now suppose we want approximate an SFGI product function, f, with a neural network.

Lemma 3.3 implies that we can approximate ϕ with any universal G-invariant neural network

which embeds our original space X to some Euclidean space R

. Then the outer architecture

ρ can be any universal architecture (e.g. MLP). Finding a universal G-invariant neural

network to realize ϕ over a single factor space X is in general much easier than ﬁnding a

SFGI neural network, and as we discussed at the end of Section 1, we already know how to

achieve this for several settings. We will show how this idea is at work for approximating

SFGI functions between point sets in the next subsection.

3.2 Universal neural networks for functions between point sets

We are interested in learning symmetric functions between point sets (i.e. any p-Wasserstein

distance) which are factor-wise permutation invariant. In this section, we will show that we

can ﬁnd a (δ, a, G)-sketch for the space of weighted point sets. This allows us to combine

Lemma 3.3 with DeepSets to deﬁne a set of neural networks which can approximate p-th

Wasserstein distances to arbitrary accuracy. Furthermore, the encoding and decoding func-

tions can be approximated with neural networks where their model complexity is independent

of input point set size. Thus, the resulting neural network used to approximate Wasserstein

distance also has bounded model complexity.

Set up. Given some metric space (Ω, d

Ω

), let X be the set of weighted point sets with at

most N elements. In other words, each S ∈ X has the form S = {(x

, w

) : w

∈ R

1, x

∈ Ω} and |S| ≤ N. One can also consider X to be the set of weighted Dirac measures

over Ω. For simplicity, we also sometimes use S to refer to just the set of points {x

}

contained within it. We will consider the metric over X to be the p-th Wasserstein distance,

. We refer to the metric space of weighted point sets over Ω as (X, W

First, we will show that given a δ, there is a (δ, a, G)-sketch of X with respect to W

The embedding dimension a depends on the so-called covering number of the metric space

(Ω, d

Ω

) from which points are sampled. Given a compact metric space (Ω, d

Ω

), the covering

number ν

Ω

(r) w.r.t. radius r is the minimal number of radius r balls needed to cover Ω.

As a simple example, consider Ω = [−∆, ∆] ⊆ R. Given any r, we can cover X with

2∆

intervals so ν

Ω

(r) ≤

2∆

. The collection of the center of a set of r-balls that cover Ω an r-net

of Ω. For a compact set Ω ⊂ R

with diameter D, its covering number ν

Ω

(r) is a constant

depending on D, r and d only.

Theorem 3.4. Set d

to be W

for 1 ≤ p < ∞. Let G be the permutation group. For any

δ > 0, let δ

δ/2 and a = ν

Ω

(δ

) be the covering number w.r.t. radius δ

. Then there is

a (δ, a, G)-sketch of X with respect to W

. Furthermore, the encoding function h : X → R

can be expressed as the following where h : Ω → R

is continuous:

h(S) =

x∈S

h(x). (1)

Proof. Let δ > 0 and let S ∈ X be S = {(x

, w

) :

= 1, x

∈ Ω} and |S| ≤ N. Given

δ/2 and a = ν

Ω

(δ

), we know Ω has a δ

-net, C, and we denote the elements of C

as {y

, . . . , y

}. In other words, for any x ∈ Ω, there is a y

∈ Cd such that d

Ω

(x, y

) < δ

First, we will deﬁne an encoding function ρ : X → R

. For each y

, we will use a soft

indicator function e

−bd

Ω

(x,B

))

and set the constant b so that e

−bd

Ω

(x,B

))

is ”suﬃciently”

small if d

Ω

(x, B

)) > δ

. More formally, we know that lim

b→∞

−bδ

= 0 so there is

β ∈ R such that for all b > β, e

−bδ

max

·a

. Set b

to be such that b

> β. Let

(x) = e

−b

Ω

(x,B

))

for each i ∈ [a]. For a given x ∈ Ω, we compute h : Ω → R

h(x) = [h

(x), . . . , h

(x)]

Then we deﬁne the encoding function h : X → R

h(S) =

i=1

h(x

)

∥h(x

)∥

Note that ∥h(S)∥

= 1 and h is continuous since Wasserstein distances metrize weak con-

vergence. Additionally, since d

Ω

(x, B

)) is the distance from x to the δ

-ball around y

we are guaranteed to have one j where h

) = 1 so ∥h(x

)∥

> 1.

Now, we deﬁne a decoding function g : R

→ X as g(v) = {(y

∥v∥

) : i ∈ [a]}. In order

to show that g and h yields a valid (δ, a, G)-sketch of X, we must show that g ◦ h(S) is

suﬃciently close to S under the W

distance. First, we know that

(g ◦ h(S), S) ≤

i=1

j=1

)

∥h(x

)∥

d(x

, y

)

Let d

max

be the diameter of Ω. For a given x

, we can partition {h

), . . . , h

)} into

those where h

) ≥

max

·a

and those where h

) <

max

·a

i.e. {h

), . . . , h

)} and

k+1

), . . . , h

)} respectively. If h

(x) ≥

max

·a

, then

−b

Ω

(x,B

))

≥

max

· a

> e

−b

so d

Ω

(x, B

)) < δ

. Then

(g ◦ h(S), S) ≤

i=1

j=1

)

∥h(x

)∥

Ω

, y

)

i=1



ℓ=1

ℓ

)

∥h(x

)∥

(2δ

)

ℓ=k+1

max

· a

max



since d

Ω

, y

) ≤ d

max

≤

i=1

+ δ

) ≤ 2



δ/2 ·



= δ

Thus, the encoding function h and the decoding function g make up a (δ, a, G)-sketch.

Note that the sketch outlined in Theorem 3.4 is a smooth version of a one-hot encod-

ing. With Theorem 3.4 and Lemma 3.3, we will now give an explicit formulation of an

ϵ-approximation of f via sum-pooling of continuous functions.

Corollary 3.5. Let ϵ > 0, (Ω, d

Ω

) be a compact metric space and let X be the space of

weighted point sets equipped with the p-Wasserstein, W

. Suppose for any δ, (Ω, d

Ω

) has

covering number a(δ). Then given a function f : X × X → R that is uniformly continuous

and permutation invariant, there is continuous functions h : Ω → R

a(δ)

, ϕ : R

a(δ)

→ R

′

, and

ρ : R

′

→ R, such that for any A, B ∈ X



f(A, B) − ρ



(x,w

)∈A

h(x)



+ ϕ



(x,w

)∈B

h(x)





< ϵ

where h, ϕ and ρ are all continuous and a

′

= 4 · a(δ) + 1.

Due to Eqn. (1), instead of considering the function h which takes a set of points S ∈ X

as input, we now only need to model the function h : Ω → R

, which takes a single point

x ∈ S as input. For simplicity, assume that the input metric space (Ω, d

Ω

) is a compact set

in some Euclidean space R

. Note that in contrast to Lemma 3.3, each h, ϕ and ρ is simply

a continuous function, and there is no further group invariance requirement. Furthermore,

all the dimensions of the domain and range of these functions are bounded values that

depend only on the covering number of Ω, the target additive error ϵ, and independent to

the maximum size N of input points. We can use multilayer perceptrons (MLPs) h

, ϕ

and ρ

in place of h, ϕ and ρ to approximate the desired function. Formally, we deﬁne the

following family of neural networks:

ProductNet

(A, B) = ρ



(x,w

)∈A

(x)



+ ϕ



(x,w

)∈B

(x)



. (2)

In practice, we consider the input to the the neural network h

to be a point x ∈ Ω along

with its weight w

. As per the discussions above, functions represented by N

ProductNet

can

universally approximate SFGI product functions on the space of point sets. See Figure 1 for

an illustration of our universal architecture for approximating product functions on point

sets. As p-Wasserstein distances, W

: X × X → R, are uniformly continuous with respect

to the underlying metric W

, we can apply our framework for the problem of approximating

p-Wasserstein.

Importantly, the number of parameters in N

ProductNet

does not depend on the maximum

size of the point set but rather only on the ϵ additive error and by extension, the covering

number of the original metric space. This is because the encoding function for our sketch

is deﬁned as the summation of single points into R

where a is independent of the size of

the input set. Contrast this result with latent dimension in the universality statement of

DeepSets (cf. Theorem 2.1), which is dependent on the input point set size. Note that in

general, the model complexities of the MLPs ρ

, ϕ

, and h

depend on the dimensions

of the domain and co-domain of each function they approximate (ρ, ϕ and h) and the

desired approximation error ϵ. We assume that MLPs are Lipschitz continuous. In our

case, ϕ

operates on the sum of h

(x)s for all N number of input points in A (or in B).

In general, the error made in hs may accumulate N times, which causes the precision we

must achieve for each individual h

(x) (compared to h(x)) to be Θ(ϵ/N). This would have

caused the model complexity of h

to depend on N. Fortunately, this is not the case for

our encoding function h. In particular, because our encoding function can be written in

the speciﬁc form of a normalized sum of individual points in the set S; i.e, ϕ

operates on

(x,w

)∈S

(x) with

= 1, the error accumulated by the normalized sum will be

less than the maximum error from any single point h(x) for x ∈ S. Thus, as both the error

for each MLP and the dimension of the domain and co-domain of each approximated function

(ρ, ϕ, h) do not depend on the size of the input point set N, we get that ρ

, ϕ

and h

each

have model complexity independent of the size of the input point set. In short, because the

encoding and decoding functions of our sketch can be approximated with neural networks

with model complexity independent of input point set size, we are able to achieve a low

model complexity neural network which can approximate Wasserstein distance arbitrarily

well. Note that in general, we are not guaranteed to be able to ﬁnd a sketch which can be

approximated by neural networks which are independent of input size. Finally, this neural

network architecture can also be applied to approximating Hausdorﬀ distance. More details

regarding Hausdorﬀ distance approximation are available in Appendix A. We summarize the

above results in the following corollary.

Corollary 3.6. There is a network in the form of N

ProductNet

which can approximate the

p-Wasserstein distance between point sets to within an additive ϵ-error. The number of

parameters in this network depends on ϵ and the covering number of Ω and not on the size

of each point set.

Additionally, if we replace the sum-pooling with max in N

ProductNet

, there is a network of

such a form where we can also approximate Hausdorﬀ distance between point sets to additive

ϵ accuracy.

Siamese

network on

set elements

Final sum-pooling and MLP

Euclidean

embeddings from

sum pooling over

set elements

Siamese

network on

embeddings

Figure 1: Visually representing a neural network which can universally approximate uniformly

continuous SFGI product functions over pairs of point sets.

Exponential dependence on dimension Although the model complexity of N

ProductNet

is independent of the size of the input point set, it depends on the covering number of Ω which,

in turn, can have an exponential dependence on the dimension of Ω. In short, this means

that the model complexity of N

ProductNet

has an exponential dependence on the dimension

of Ω. However, in practice, many machine learning tasks (e.g. 3D point processing) involve

large point sets sampled from low-dimensional space (d = 3). Furthermore, in general, the

covering number of Ω will depend on the intrinsic dimension of Ω rather than the ambient

dimension. For instance, if the input point sets are sampled from a hidden manifold of

dimension d

′

(where d

′

which is much lower than the ambient dimension d), then the covering

number would depend only on d

′

and the curvature bound of the manifold. In many modern

machine learning applications, it is often assumed that the data is sampled from a hidden

space of low dimension (the manifold hypothesis) although the ambient dimension might be

very high.

Using max-pooling instead of sum-pooling. Observe that in the ﬁnal step of combin-

ing the Euclidean outputs for two point sets A, B ∈ X,

(x,w

)∈A

(x)+

(x,w

)∈B

(x),

we use the sum of these two components (as in a DeepSet architecture) :

(x,w

)∈A

(x)

and

(x,w

)∈B

(x), to ensure the symmetric condition of SFGI product functions. One

could replace this ﬁnal sum with a ﬁnal max such as in PointNet. However, to show that

PointNets are able to universally approximate continuous functions F : K → R where

K ⊆ R

a×2

is compact, we need to use a δ-net for K which will also serve as the intermediate

dimension for ϕ. As K ⊆ [0, N]

a×2

in our case (where N is the maximum cardinality for a

point set), the intermediate dimension for a max-pooling invariant architecture at the end

(i.e. PointNet) now depends on the maximum size of input point sets.

4 Experimental results

We evaluate the accuracy of the 1-Wasserstein distance approximations of our proposed

neural network architecture, N

ProductNet

, against two diﬀerent baseline architectures: (1) a

Siamese autoencoder known as the Wasserstein Point Cloud Embedding network (WPCE)

Table 1: Mean relative error between approximations and ground truth Wasserstein distance

between point sets. The top row for each dataset shows the approximation quality for

point sets with input sizes that were seen at training time; while he bottom row shows the

approximation quality for point sets with input sizes that were not seen at training time.

Note that N

ProductNet

is our model.

Dataset Input size N

ProductNet

WPCE N

SDeepSets

Sinkhorn

noisy-sphere-3

[100, 300] 0.046 ± 0.043 0.341 ± 0.202 0.362 ± 0.241 0.187 ± 0.232

[300, 500] 0.158 ± 0.198 0.356 ± 0.286 0.608 ± 0.431 0.241 ± 0.325

noisy-sphere-6

[100, 300] 0.015 ± 0.014 0.269 ± 0.285 0.291 ± 0.316 0.137 ± 0.122

[300, 500] 0.049 ± 0.054 0.423 ± 0.408 0.508 ± 0.473 0.198 ± 0.181

uniform

256 0.097 ± 0.073 0.120 ± 0.103 0.123 ± 0.092 0.073 ± 0.009

[200, 300] 0.131 ± 0.096 1.712 ± 0.869 0.917 ± 0.869 0.064 ± 0.008

ModelNet-small

[20, 200] 0.084 ± 0.077 0.077 ± 0.075 0.105 ± 0.096 0.101 ± 0.032

[300, 500] 0.111 ± 0.086 0.241 ± 0.198 0.261 ± 0.245 0.193 ± 0.155

ModelNet-large

2048 0.140 ± 0.206 0.159 ± 0.141 0.166 ± 0.129 0.148 ± 0.048

[1800, 2000] 0.162 ± 0.228 0.392 ± 0.378 0.470 ± 0.628 0.188 ± 0.088

RNAseq

[20, 200] 0.012 ± 0.010 0.477 ± 0.281 0.482 ± 0.291 0.040 ± 0.009

[300, 500] 0.292 ± 0.041 0.583 ± 0.309 0.575 ± 0.302 0.048 ± 0.006

(Kawano et al., 2020) (previously introduced at the end of Section 1 and is a SOTA method

of neural approximation of Wasserstein distance) and (2) a Siamese DeepSets, denoted as

SDeepSets

, which is a single DeepSets model which maps both point sets to a Euclidean

space and approximates the 1-Wasserstein distance as the ℓ

norm between output of each

point set. As Siamese networks are widely employed for metric learning, N

SDeepSets

model is

a natural baseline for comparison again N

ProductNet

. We additionally test our neural network

approximations against the Sinkhorn distance where the regularization parameter was set to

ϵ = 0.1. For each model, we record results for the best model according to a hyperparameter

search with respect to the parameters of each model. Finally, we use the ModelNet40 (Wu

et al., 2015) dataset which consists of point clouds in R

and a gene expression dataset

(RNAseq) which consists of 4360 cells each represented by 2000 genes (i.e. 4360 points

in R

2000

) as well as three synthetic datasets: (1) uniform, where point sets are in R

, (2)

noisy-sphere-3, where point sets are in R

, (3) noisy-sphere-6, where point sets are in R

The RNAseq dataset is publicly available courtesy of the Allen institute (Yao et al., 2021).

Additional details and experiments approximating the 2-Wasserstein distance are available

in Appendix D.

Approximating Wasserstein distances. Our results comparing 1-Wasserstein distance

approximations are summarized in Table 1. Additionally, see Table 5 for a summary of time

needed for training. For most datasets, N

ProductNet

produces more accurate approximations

of Wasserstein distances for both input point sets seen at training time and for input point

sets unseen at training time. For the high dimensional RNAseq dataset, our approxima-

Table 2: Training time and number of epochs needed for convergence for best model

Dataset N

ProductNet

WPCE N

SDeepSets

noisy-sphere-3

Time 6min 1hr 46min 9min

Epochs 20 100 100

noisy-sphere-6

Time 12min 4hr 6min 1hr 38min

Epochs 20 100 100

uniform

Time 7min 3hr 36min 1hr 27min

Epochs 23 100 100

ModelNet-small

Time 7min 1hr 23min 12 min

Epochs 20 100 100

ModelNet-large

Time 8min 3hr 5min 40min

Epochs 20 100 100

RNAseq(2k)

Time 15min 14hr 26min 3hr 01min

Epochs 73 100 100

tion remains accurate in comparison with other methods, including the standard Sinkhorn

approximation for input point sets seen at training time. The only exception is ModelNet-

small, where the N

ProductNet

approximation error is slightly larger than WPCE for input

point set sizes using during training (top row for each dataset in Table 1). However, for

point sets where the input sizes were not used during training (bottom row for each dataset

in Table 1), N

ProductNet

showed siginiﬁcantly lower error than all other methods including

WPCE. These results along with a more detailed plot in Figure 3 in Appendix D indicate

that N

ProductNet

generalizes better than WPCE to point sets of input sizes unseen at training

time. Also, see Appendix D for additional discussion about generalization. Furthermore,

one major advantage of N

ProductNet

over WPCE is the dramatically reduced time needed for

training (cf. Table 2). This substantial diﬀerence in training time is due to WPCE’s usage

of the Sinkhorn reconstruction loss as the O(n

) computation time for the Sinkhorn distance

can be prohibitively expensive as input point set sizes grow. Thus, our results indicate that,

compared to WPCE, N

ProductNet

can reduce training time while still achieving comparable

or better quality approximations of Wasserstein distance. Using our N

ProductNet

, we can

produce high quality approximations of 1-Wasserstein distance while avoiding the extra cost

associated with using an autoencoder architecture and Sinkhorn regularization. Finally, all

models produce much faster approximations than the Sinkhorn distance (see Tables 5 and

8 in Appendix D). In summary, as compared to WPCE, our model is more accurate in ap-

proximating both 1-Wasserstein distance, generalizes better to larger input point set sizes,

and is more eﬃcient in terms of training time.

5 Concluding Remarks

Our work presents a general neural network framework for approximating SFGI functions

which can be combined with geometric sketching ideas into a speciﬁc and eﬃcient neural

network for approximating p-Wasserstein distances. We intend to utilize N

ProductNet

as an

accurate, eﬃcient, and diﬀerentiable approximation for Wasserstein distance in down-

stream machine learning tasks where Wasserstein distance is employed, such as loss functions

for aligning single cell multi-omics data (Demetci et al., 2020) or compressing energy proﬁles

in high energy particle colliders (Di Guglielmo et al., 2021; Shenoy and The CMS Collabo-

ration Team, 2022). Beyond Wasserstein distance, we will look to apply our framework to a

wide array of geometric problems that can be considered SFGI functions and are desireable to

approximate via neural networks. For instance, consider the problems of computing the op-

timal Wasserstein distance under rigid transformation or the Gromov-Wasserstein distance,

which both can be represented as an SFGI function where the factor-wise group invariances

include both permutation and rigid transformations. Then our sketch must be invariant to

both permutations and orthogonal group actions on the left. It remains to be seen if there

is a neural network architecture which can approximate such an SFGI function to within an

arbitrary additive ϵ-error where the complexity does not depend on the maximum size of the

input set.

6 Acknowledgements

The authors thank anonymous reviewers for their helpful comments. Furthermore, the au-

thors thank Rohan Gala and the Allen Institute for generously providing the RNAseq dataset.

Finally, Samantha Chen would like to thank Puoya Tabaghi for many helpful discussions

about permutation invariant neural networks and Tristan Brug`ere for his implementation of

the Sinkhorn algorithm. This research is supported by National Science Foundation (NSF)

grants CCF-2112665 and CCF-2217058 and National Institutes of Health (NIH) grant RF1

MH125317.

References

Tal Amir, Steven J Gortler, Ilai Avni, Ravina Ravina, and Nadav Dym. Neural injective

functions for multisets, measures and graphs via a ﬁnite witness theorem. arXiv preprint

arXiv:2306.06529, 2023.

Arturs Backurs, Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Scalable nearest

neighbor search for optimal transport. In International Conference on machine learning,

pages 497–506. PMLR, 2020.

Pietro Bongini, Monica Bianchini, and Franco Scarselli. Molecular generative graph neural

networks for drug discovery. Neurocomputing, 450:242–252, 2021.

Emmanuel Briand. When is the algebra of multisymmetric polynomials generated by the

elementary multisymmetric polynomials? Beitr¨age zur Algebra und Geometrie: Contri-

butions to Algebra and Geometry, 45 (2), 353-368., 2004.

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, and Roopak Shah. Signature

veriﬁcation using a” siamese” time delay neural network. Advances in neural information

processing systems, 6, 1993.

Christian Bueno and Alan Hylton. On the representation power of set pooling networks.

Advances in Neural Information Processing Systems, 34:17170–17182, 2021.

Nicolas Courty, R´emi Flamary, and M´elanie Ducoﬀe. Learning wasserstein embeddings.

arXiv preprint arXiv:1710.07457, 2017.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances

in neural information processing systems, 26, 2013.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of

control, signals and systems, 2(4):303–314, 1989.

Pinar Demetci, Rebecca Santorella, Bj¨orn Sandstede, William Staﬀord Noble, and Ritamb-

hara Singh. Gromov-wasserstein optimal transport to align single-cell multi-omics data.

BioRxiv, pages 2020–04, 2020.

Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and

Leonidas J Guibas. Vector neurons: A general framework for so (3)-equivariant net-

works. In Proceedings of the IEEE/CVF International Conference on Computer Vision,

pages 12200–12209, 2021.

Giuseppe Di Guglielmo, Farah Fahim, Christian Herwig, Manuel Blanco Valentin, Javier

Duarte, Cristian Gingu, Philip Harris, James Hirschauer, Martin Kwok, Vladimir Loncar,

et al. A reconﬁgurable neural network asic for detector front-end data compression at the

hl-lhc. IEEE Transactions on Nuclear Science, 68(8):2179–2186, 2021.

Weitao Du, He Zhang, Yuanqi Du, Qi Meng, Wei Chen, Nanning Zheng, Bin Shao, and

Tie-Yan Liu. Se (3) equivariant graph neural networks with complete local frames. In

International Conference on Machine Learning, pages 5583–5608. PMLR, 2022.

Nadav Dym and Steven J Gortler. Low dimensional invariant embeddings for universal

geometric learning. arXiv preprint arXiv:2205.02956, 2022.

Jean Feydy, Thibault S´ejourn´e, Fran¸cois-Xavier Vialard, Shun-ichi Amari, Alain Trouve,

and Gabriel Peyr´e. Interpolating between optimal transport and mmd using sinkhorn

divergences. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics,

pages 2681–2690, 2019.

R´emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur´elie Boisbunon,

Stanislas Chambon, Laetitia Chapel, Adrien Corenﬂos, Kilian Fatras, Nemo Fournier, L´eo

Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko,

Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard,

Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine

Learning Research, 22(78):1–8, 2021. URL http://jmlr.org/papers/v22/20-451.html.

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl.

Neural message passing for quantum chemistry. In International conference on machine

learning, pages 1263–1272. PMLR, 2017.

Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an

invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and

Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.

Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. In 3rd international

workshop on statistical and computational theories of vision, volume 2, page 5, 2003.

Keisuke Kawano, Satoshi Koide, and Takuro Kutsuna. Learning wasserstein isometric em-

bedding for point clouds. In 2020 International Conference on 3D Vision (3DV), pages

473–482. IEEE, 2020.

Nicolas Keriven and Gabriel Peyr´e. Universal invariant and equivariant graph neural net-

works. Advances in Neural Information Processing Systems, 32, 2019.

Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for

one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.

Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and

Stefanie Jegelka. Sign and basis invariant networks for spectral graph representation

learning. arXiv preprint arXiv:2202.13013, 2022.

Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport

mapping via input convex neural networks. In International Conference on Machine

Learning, pages 6672–6681. PMLR, 2020.

Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant

graph networks. arXiv preprint arXiv:1812.09902, 2018.

Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of

invariant networks. In International conference on machine learning, pages 4363–4371.

PMLR, 2019.

Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. On learning sets of symmetric

elements. In International conference on machine learning, pages 6734–6744. PMLR, 2020.

Assaf Naor and Gideon Schechtman. Planar earthmover is not in l 1. SIAM Journal on

Computing, 37(3):804–826, 2007.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,

Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-

perative style, high-performance deep learning library. Advances in neural information

processing systems, 32, 2019.

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on

point sets for 3d classiﬁcation and segmentation. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages 652–660, 2017.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Mon-

fardini. The graph neural network model. IEEE transactions on neural networks, 20(1):

61–80, 2008.

Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of

Mathematical Imaging and Vision, 56(2):238–259, 2016.

Nimrod Segol and Yaron Lipman. On universal equivariant set networks. arXiv preprint

arXiv:1910.02421, 2019.

Rohan Shenoy and The CMS Collaboration Team. EMD Neural Network for HGCAL Data

Compression Encoder ASIC. In APS April Meeting Abstracts, volume 2022 of APS Meeting

Abstracts, page Q09.008, January 2022.

Amirhossein Taghvaei and Amin Jalali. 2-wasserstein approximation via restricted convex po-

tentials with application to improved training for gans. arXiv preprint arXiv:1902.07197,

2019.

Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung.

Revisiting point cloud classiﬁcation: A new benchmark dataset and classiﬁcation model

on real-world data. In Proceedings of the IEEE/CVF international conference on computer

vision, pages 1588–1597, 2019.

Edward Wagstaﬀ, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A Osborne.

On the limitations of representing functions on sets. In International Conference on Ma-

chine Learning, pages 6487–6494. PMLR, 2019.

Edward Wagstaﬀ, Fabian B Fuchs, Martin Engelcke, Michael A Osborne, and Ingmar Posner.

Universal approximation of functions on sets. Journal of Machine Learning Research, 23

(151):1–56, 2022.

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and

Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings

of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.

Zizhen Yao, Cindy TJ van Velthoven, Thuc Nghi Nguyen, Jeﬀ Goldy, Adriana E Sedeno-

Cortes, Fahimeh Baftizadeh, Darren Bertagnolli, Tamara Casper, Megan Chiang, Kirsten

Crichton, et al. A taxonomy of transcriptomic cell types across the isocortex and hip-

pocampal formation. Cell, 184(12):3222–3241, 2021.

Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. Con-

structive Approximation, 55(1):407–474, 2022.

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhut-

dinov, and Alexander J Smola. Deep sets. Advances in neural information processing

systems, 30, 2017.

A Approximating Hausdorﬀ distance

Suppose the underlying metric for X (the space of weighted point sets) is Hausdorﬀ distance,

. To clarify, given S

= {(x

, w

) : x

∈ R

i=1

} and S

= {(y

, w

′

: y

∈ R

i=1

′

, S

) = d

({x

, . . . , x

}, {y

, . . . , y

}).

To get an encoding/decoding for X equipped with Hausdorﬀ distance using max-pooling,

we will use the construction given in the proof of universality in (Qi et al., 2017), which

we will brieﬂy recap here. As in Lemma 3.4, we consider a δ-net for our original space X,

, . . . , y

}. For each y

, we will again let h

(x) = e

−d

(x,B

))

and deﬁne h : X → R

h(x) = [h

(x), . . . , h

(x)]. However, instead of using sum-pooling to deﬁne h : X → R

, we

will use max-pooling as

h(S) = max

x∈S

h(x)

where max is element-wise max. Then the decoding function g : R

→ X is deﬁned as

g(v) = {(y

∥v∥

) : v

≥ 1}

Note that v

≥ 1 when there is an x

∈ S such that x

∈ B

). So for each x

∈ S there is

∈ g ◦ h(S) such that d

, y

) < δ. Thus, d

(S, g ◦ h(S)) < δ.

The ﬁnal N

ProductNet

for approximating uniformly continuous functions for Hausdorﬀ

distance with max-pooling is

max

ProductNet

(A, B) = ρ



max

x∈A

(x))



+ ϕ



max

y∈B

(y))



Note that N

max

ProductNet

is able to approximate Hausdorﬀ distance to ϵ-accuracy.

B Comparison with other neural network architectures

for approximating Wasserstein distance

Through the lens of 1-Wasserstein approximation, we compare the power of N

ProductNet

with

other natural architectures, namely standard MLPs and Siamese networks. The standard

MLP and Siamese networks are the natural points of comparison for N

ProductNet

as (1) MLPs

are the go-to choice in deep learning due to their versatility and eﬃcacy across diverse

problem domains and (2) Siamese networks have gained signiﬁcant prominence in the context

of metric learning, frequently employed for tasks aimed at learning eﬀective similarity metrics

between pairs of inputs(Hadsell et al., 2006; Bromley et al., 1993; Koch et al., 2015) However,

we discuss the limitations of both the standard MLP and Siamese networks for the taks

of learning Wasserstein distances. For this setting, we will let (X, d

) be some compact

Euclidean space, i.e. X ⊆ R

is compact and d

is some ℓ

norm. Then X be the set of

weighted point sets of cardinality at most N i.e. S = {(x

, w

) : w

∈ R

= 1, x

∈ X}.

Comparison with multilayer perceptrons (MLPs). First, we consider a standard

MLP model where the model takes a single tensor as input. Note that for each S, there is

S ∈ X such that

S = {(x

, w

) : (x

, w

) ∈ S} ∪ {(x

, 0)

×(N−|S|)

}

where (x

, 0)

×(N−|S|)

means that we repeat the element (x

, 0) (N − |S|) times. Notice that

(S,

S) = 0 and for any S

′

∈ X, W

(S, S

′

) = W

(

S, S

′

). If |S| = N,

S = S To use an

MLP model to approximate Wasserstein distance, we want to represent any (S

, S

) ∈ X ×X

as an element of R

(d+1)×2N

where S

= {(x

, w

) : w

∈ R

= 1, x

∈ X} and

= {(x

′

, w

′

) : w

′

∈ R

′

= 1, x

′

∈ X}. Informally, given any “empty” element

in R

(d+1)×2N

, we will map S

to the ﬁrst N columns and map S

to the last N columns.

Formally, we will deﬁne the mapping β : X × X → R

(d+1)×2N

given S

and S

as deﬁned

above as

β(S

, S

) = β(

S) =



··· x

′

··· x

′

··· w

′

··· w

′



where we abuse notation slightly and use (x

, w

) and (x

′

, w

′

) to describe the elements in

and

respectively. Given M ∈ R

(d+1)×2N

, deﬁne β

−1

: Image(β) → X × X as

−1



··· x

′

··· x

′

··· w

′

··· w

′



= ({(x

, w

)}, {(x

′

, w

′

)})

Now, in order to use the classical universal approximation theorem to approximate W

to an

ϵ-approximation error, we must show that there is a continuous function f : Image(β) → R

such that W

, S

) = f(β(S

, S

)).

Lemma B.1. Given β : X × X → R

(d+1)×2N

as deﬁned previously, there is a continuous

function f : Image(β) → R such that for any ϵ > 0, and for any (S

, S

) ∈ X × X,

, S

) − f(β(S

, S

))| < ϵ.

Proof. Let ϵ > 0. From Corollary 3.5, there is a δ > 0 such that are continuous h : X → R

a(δ)

ϕ : R

a(δ)

→ R

′

, and ρ : R

′

→ R, such that for any A, B ∈ X



(A, B) − ρ



x∈A

h(x)



+ ϕ



x∈B

h(x)





< ϵ.

Let (S

, S

) ∈ X × X and let M = β(S

, S

). Note that β

−1

(M) = (S

, S

). Let M ∈

Image(β), let M[:, i] denote the ith column vector of M, let M[: d, i] denote vector of the

ﬁrst d elements in the ith column vector of M and let M[i, j] denote the ijth entry of

M. We know that M[:, i] ∈ R

d+1

. Since M[:, i] ∈ Image(β), we know that M[: d, i] ∈ X

and M[d, i] corresponds to a weight. Furthermore, since h is a continuous mapping from

X → R, M[d, i]h(M[: d, i]) is continuous. Thus, since ϕ is also continuous, the function

′

: Image(β) → R

′

ϕ(

i=1

M[d, i]h(M[: d, i])) + ϕ(

i=N+1

M[d, i]h(M[: d, i]))

is continuous. Since ρ is also continuous, we get that ρ ◦ f

′

: Image(β) → R is continu-

ous. Thus, f : Image(β) → R which is f = ρ ◦ f

′

is also continuous and |f(β(S

, S

)) −

, S

)| < ϵ by the property from Corollary 3.5

Since f in the above Lemma operates on ﬁxed dimensional space, we can now use an

MLP to approximate it. Then by the above Lemma and universal approximation of MLPs,

there is an MLP which can approximate W

: X × X → R to an arbitrary ϵ additive

error. However, this approach has the following issues: while an MLP can approximate

to an arbitrary ϵ additive error, the model complexity of the MLP depends on not

only the approximation error, but also on the maximum size N of the input point set.

In contrast, our neural network N

ProductNet

introduced in Eqn (2) has model complexity

which is independent of N; see Corollary 3.6. Furthermore, the function represented by

the MLP (that we use to approximate ϕ) however may not be symmetric to the two point

sets, and may not guarantee permutation invariance with respect to G × G. In practice,

computationally expensive techniques such as data augmentation are often required in order

for an unconstrained MLP to approximate a structured function such as our Wasserstein

distance function, which is an SFGI-function.

Comparison with Siamese architectures. As mentioned previously, one common way

of learning product functions is to use a Siamese network which embeds X to Euclidean

space and then use some ℓ

norm to approximate the desired product function. Consider the

learning of Wasserstein distances again. To approximate W

(A, B) for two point sets A and

B. The Siamese network will ﬁrst embed each point set to a ﬁxed dimensional Euclidean

space R

via a function ϕ

modeled by a neural network, and then compute ∥ϕ

(A)−ϕ

(B)∥

for some 1 ≤ q < ∞. Intuitively, compared to our neural network N

ProductNet

introduced in

Eqn (2) (and recall Figure 1), the Siamese network will replace the outer function ρ by simply

the L

-norm of ϕ

(A) −ϕ

(B). However, this approach intuitively requires that one can ﬁnd

a near-isometric embedding of the Wasserstein distance to the L

distance in a Euclidean

space. However, there exists lower bound results on the distortion incurred when embedding

Wasserstein distance to L

space. In the following section, we will review one such result on

the lower bound of metric distortions when embedding the Wasserstein distance to L

space;

implying that if we chose q = 1, then in the worst case the Siamese network will incur at

least as much distortion as that lower bound.

C Non-embeddability theorems for Wasserstein dis-

tances

Here we summarize results pertaining to the limitations of embedding Wasserstein distance to

the L

distance in a Euclidean space. Consider probability distributions over a grid of points

in R

, {0, 1, . . . , D}

equipped with the 1-Wasserstein distance, (P({0, 1, . . . , D}

), W

). Let

denote the space of Lebesgue measureable functions f : [0, 1] → R such that

∥f∥

|f(t)|dt.

Given a mapping F : (P({0, 1, . . . , n}

), W

) → L

such that for any µ, ρ ∈ (P({0, 1, . . . , D}

), W

(µ, ρ) ≤ ∥F (µ) − F (ρ)∥ ≤ L · W

(µ, ρ) (3)

the distortion is the value of L.

Theorem C.1 ((Naor and Schechtman, 2007)). Any embedding (P({0, 1, . . . , n}

, W

) → L

must incur distortion at least Ω(

√

log D)

For (P({0, 1, . . . , D}

), W

) where d ≥ 2, P({0, 1, . . . , D}

contains P({0, 1, . . . , D}

) so

the lower bound O(

√

D) still applies for L

embeddings from P({0, 1, . . . , D}

), d > 2.

From the above results, we can see that for any Siamese architecture (even in the simple

case of ﬁnite point sets in R

and R

), we are unable to approximate the Wassserstein

distances via any L

embedding.

Corollary C.2. Given a neural network N

Siamese

: X → R

where X are weighted point sets

over {0, 1, . . . , D}

, ∥N

Siamese

(µ) − N

Siamese

(ν)∥

incurs distortion at least Ω(

√

log D).

Note that if we consider our input for a Siamese architecture to be ﬁnite point sets over

{0, 1, . . . , D}

, we allow multisets so the input set size is not bounded by D.

D Experimental details

D.1 Baseline models for comparison with N

ProductNet

Here, we will detail two baseline neural networks in our experiments.

Wasserstein point cloud embedding (WPCE) network First deﬁned by (Kawano

et al., 2020), the WPCE network is an Siamese autoencoder architecture. It consists of an

encoder network N

encoder

and a decoder network N

decoder

. WPCE takes as input two point

sets P, Q ⊆ R

. N

encoder

is a permutation invariant neural network - which we chose to be

DeepSets. In other words,

encoder

(P ) = ϕ



x∈P

(x)



Note that one may also choose PointNet to be N

encoder

. However, in our experiments, we did

not see a large diﬀerence in approximation quality between using PointNet and DeepSets

(where the diﬀerence between the two amounts to using a sum vs. max aggregator). For sake

of consistency with N

ProductNet

, we chose to use a sum aggregator for N

encoder

(DeepSets).

The decoder network, N

decoder

, is a fully connected neural network which outputs a ﬁxed-

size point set in R

. WPCE then uses a Sinkhorn reconstruction loss term to regularize

the embedding produced by the encoder. Thus, given a set of paired input point sets and

their Wasserstein distance {(P

, Q

, W

, Q

)) : i ∈ [N]} the loss which we optimize over

for WPCE is

L(P, Q) =



∥N

encoder

) − N

decoder

)∥

− W

, Q

)



+ λ



decoder

encoder

)) +

decoder

encoder

))



(4)

Table 3: Size of the output point set from the WPCE decoder for each dataset.

noisy-sphere-2 noisy-sphere-6 uniform ModelNet-small ModelNet-large RNAseq

Size 200 200 256 100 2048 100

Table 4: Summary of details for each dataset. Note that min and max refer to the minimum

and maximum number of points per input point set.

dim min max training pairs val pairs

noisy-spheres-2 3 100 300 2000 200

noisy-spheres-6 6 100 300 3600 400

uniform 2 256 256 3000 300

ModelNet-small 3 20 200 3000 300

ModelNet-large 3 2048 2048 4000 400

RNAseq 2000 20 200 3000 300

where λ is some constant in R that controls the balance between the two loss terms and S

is the Sinkhorn divergence between P

and Q

. Note that ϵ in S

refers to the regularization

parameter for the Sinkhorn divergence. For our experiments, we chose λ = 0.1 and ϵ = 0.1.

Furthermore, for each dataset used in our experiments, we set used a range of diﬀerent sizes

for the ﬁxed-size output point set. This parameter is summarized per dataset in Table 3.

Siamese DeepSets. The baseline Siamese DeepSets model, N

SDeepSets

(·, ·), consists of a

single DeepSets model which maps two input point sets to some Euclidean space R

. The ℓ

norm between the ﬁnal embeddings is then used as the ﬁnal estimate for Wasserstein distance.

Formally, let N

DeepSets

(P ) = ϕ



x∈P

(x)



, where ϕ

and h

are both MLPs, be the

DeepSets model. Then given two point sets P, Q, the ﬁnal approximation of Wasserstein

distance given by N

SDeepSets

(P, Q) is

SDeepSets

(P, Q) = ∥N

DeepSets

(P ) − N

DeepSets

(Q)∥

D.2 Training and implementation details

Datasets. We used several synthetic datasets as well as the ModelNet40 point cloud

dataset. Furthermore, we used two diﬀerent types of synthetic datasets. We construct the

‘noisy-sphere-d’ dataset by sampling pairs of point clouds from four d-dimensional spheres

centered at the origin with increasing radiuses of 0.25, 0.5, 0.75, and 1.0. For our experiments,

we used ‘noisy-sphere-3’ and ‘noisy-sphere-6’. Finally, the ‘uniform’ dataset of points sets in

is constructed by sampling points sets from the uniform distribution on [−4, 4] ×[−4, 4].

The full details and names of each dataset are summarized in Table 4.

Implementation. We implement all neural network architectures using the PyTorch(Paszke

et al., 2019) and GeomLoss (Feydy et al., 2019) libraries. The ground truth 1-Wasserstein

distances and Sinkhorn approximations were computed using Python Optimal Transport

(POT) (Flamary et al., 2021). Note that for large point sets and for higher dimensional

datasets, there is often a high degree of numerical instability in the POT implementation

of the Sinkhorn algorithm. In these cases (ModelNet-large and RNAseq) we used our own

implementation of the Sinkhorn algorithm. For each model, we used the Leaky-ReLU as

the non-linearity. To train each model, we set the batch size for each dataset to be 128

and the learning rate to 0.001. All models were trained on an Nvidia RTX A6000 GPU.

For both N

ProductNet

and N

SDeepSets

, given two input point sets, we minimize on the mean

squared error between the approximation produced by the network and the true Wasserstein

distance. In other words, given two point sets P, Q, the loss for N

ProductNet

is deﬁned as

ProductNet

(P, Q) =



ProductNet

(P, Q) − W

(P, Q)



and the loss for Siamese DeepSets is

SDeepSets

(P, Q) =



SDeepSets

(P, Q) −W

(P, Q)



. For WPCE, we train the network using

the loss function deﬁned in Eq. 4. Note that for N

ProductNet

, the hyperparameters are the

width, depth, and output dimension for the MLPs which represent h

and ϕ

and the width

and depth the MLP which represents ρ

. For WPCE, we set the decoder to a three layer

neural network with width 100 and adjusted the width, depth, and output dimension for

the MLPs which represent ϕ

and h

in N

encoder

. To ﬁnd the best model for each architec-

ture, we randomly sampled hyperparameter conﬁgurations and conducted a hyperparameter

search over 85 models for N

ProductNet

and 75 models for both WPCE and N

SDeepSets

D.3 Approximating 2-Wasserstein distance

To further show the use of our model, we additionally approximate the 2-Wasserstein dis-

tance; see Table 7 for results. The experimental set-up is the same as for 1-Wasserstein

distance and we largely see the same trends as we see for 1-Wasserstein distance; that is,

ProductNet

outperforms all other neural network implementations. Note that Table 7 shows

that the Sinkhorn approximation with ϵ = 0.01 is more accurate than N

ProductNet

. How-

ever, as the ϵ parameter for the Sinkhorn approximation decreases, the computation time

increases. In Table 8, we show that the Sinkhorn approximation is already much slower

than N

ProductNet

at ϵ = 0.1 while also having a less accurate approximation. The Sinkhorn

approximation with ϵ = 0.01 (reported in Table 7 is slower the Sinkhorn approximation with

ϵ = 0.1 and additionally, is also much slower than N

ProductNet

at inference time.

D.4 Generalization to large point sets

In addition to the results reported in Tables 1 and 7, which record the average relative error

for point sets of unseen sizes during training, we have also included several plots in Figures 3

and 2 which demonstrate how the error for each approximation method changes as the input

point set size increases for both 1-Wasserstein distance and 2-Wasserstein distance. Observe

that for ModelNet-small, noisy-sphere-3, and noisy-sphere-6, the error for N

ProductNet

exhibits

a signiﬁcantly slower increase as compared to WPCE. The rate at which the error increases is

not as evident for the uniform and ModelNet-large datasets. It is worth mentioning that we

trained the model on ﬁxed-size input for both of these datasets. It is possible that training

with ﬁxed-size input leads a rapid deterioration in approximation quality for WPCE and

Table 5: Time for 500 1-Wasserstein distance computations in seconds. Note that we chose

input point set size to be the maximum possible point set size that we trained on. Addi-

tionally, the Sinkhorn distance reported uses ϵ = 0.1 as the regularization parameter. Note

that as ϵ decreases, the error incurred by the Sinkhorn approximation will decrease but the

computation time will also increases. Here, when ϵ = 0.1, the Sinkhorn approximation is

already much slower than the neural network approximations while being less accurate.

Models

Dataset Input size N

ProductNet

WPCE N

SDeepSets

Sinkhorn Ground truth

noisy-sphere-3 300 1.050 0.676 0.4904 2.591 2.813

noisy-sphere-6 300 0.752 0.491 0.503 1.986 6.6770

uniform 256 0.155 0.184 0.137 15.113 1.018

ModelNet-small 200 0.330 0.330 0.191 2.074 1.615

ModelNet-large 2048 1.174 1.571 0.612 239.448 254.947

RNAseq 200 1.128 0.856 0.792 92.153 105.908

SDeepSets

when dealing with point sets of sizes not seen at training time. Furthermore,

consider that WPCE may be especially sensitive to diﬀerences in input sizes at testing time

as training WPCE depends on minimizing the Wasserstein diﬀerence between the input point

set and a ﬁxed-size decoder point set which may cause the model to be overly specialized

to point sets of a ﬁxed input size. This observation could provide an explanation for the

observed plots in both the ModelNet-large and uniform cases. Finally, as predicted in our

theoretical analysis, the performance of the model degrades for higher dimensional datasets

i.e. the RNAseq dataset.

E Extra proofs

E.1 Proof of DeepSets Universality.

Here we will provide extra details on the proof of Theorem 2.1 using multisymmetric poly-

nomials. Note that multisymmetric polynomials were previously used in (Segol and Lipman,

2019) and (Maron et al., 2019) to show universality for equivariant set networks and arbitrary

G-invariant neural networks for any permutation group G.

Proof. To begin, we will ﬁrst deﬁne multisymmetric polynomials and power sum multi-

symmetric polynomials. Let A[y

, . . . , y

] be the ring of polynomials in n variables with

coeﬃcients in a ring A.

Deﬁnition E.1 (Multisymmetric polynomials). The multisymmetric polynomials on the n

families of k variables x

. . . , x

where x

= (x

i,1

, . . . , x

i,k

) are those polynomials that remain

unchanged under every permutation of the n families, x

, . . . , x

Let A be a ring. The algebra of multisymmetric polynomials in n families of k variables

with coeﬃcients in A is denoted J

(A).

(a) ModelNet-small (b) noisy-sphere-3. (c) noisy-sphere-6.

(d) uniform. (e) ModelNet-large. (f) RNAseq.

Figure 2: Average error for 1-Wasserstein approximation for each model as the maximum

number of points increases. Note that the Sinkhorn approximation is the ϵ = 0.1. These

graphs include point set sizes not seen at training time to display how each approximation

performs on unseen examples. Note that especially for ModelNet-small, noisy-sphere-3, and

noisy-sphere-6, we can see that the error for N

ProductNet

increases at a slower rate than

WPCE.

Table 6: Comparison of the mean relative error versus overall computation time for 500

approximations of 1-Wasserstein distance for N

ProductNet

and the Sinkhorn distance. Note

that the input point set size is the same as in Table 3 for each dataset. The parameter ϵ

controls the accuracy of the Sinkhorn approximation with lower ϵ corresponding to a more

accurate approximation once the Sinkhorn algorithm converges. However, notice that in some

cases, the Sinkhorn algorithm with ϵ = 0.01 has a higher relative error than the Sinkhorn

algorithm with ϵ = 0.1 as the algorithm fails to converge within a reasonable number of

iterations (1000).

Dataset N

ProductNet

Sinkhorn (ϵ = 0.10) Sinkhorn (ϵ = 0.01)

ModelNet-small

Error 0.084 ± 0.077 0.187 ± 0.232 0.011 ± 0.003

Time (s) 0.330 2.074 104.712

ModelNet-large

Error 0.140 ± 0.206 0.148 ± 0.048 0.026 ± 0.008

Time (s) 1.174 239.448 1930.852

Uniform

Error 0.097 ± 0.073 0.073 ± 0.009 0.023 ± 0.098

Time (s) 0.155 15.113 63.028

noisy-sphere-3

Error 0.046 ± 0.043 0.187 ± 0.232 0.162 ± 0.132

Time (s) 1.050 2.591 214.185

noisy-sphere-6

Error 0.015 ± 0.014 0.137 ± 0.122 0.326 ± 0.135

Time (s) 0.752 1.986 101.763

RNAseq

Error 0.012 ± 0.010 0.040 ± 0.009 0.035± 0.013

Time (s) 1.128 92.153 91.573

(a) ModelNet-small (b) noisy-sphere-3. (c) noisy-sphere-6.

(d) uniform. (e) ModelNet-large. (f) RNAseq.

Figure 3: Average 2-Wasserstein error for each model as the maximum number of points

increases. Note that for all datasets, the Sinkhorn error is with ϵ = 0.10.

Table 7: Mean relative error between approximations and 2-Wasserstein distance between

point sets. The top row for each dataset shows the error for point sets with input sizes that

were seen at training time; while the bottom row shows the error for point sets with input

sizes that were not seen at training time. Note that the Sinkhorn approximation is computed

with the regularization parameter set to ϵ = 0.01.

Dataset Input size N

ProductNet

WPCE N

SDeepSets

Sinkhorn

noisy-sphere-3

[100, 300] 0.054 ± 0.071 0.291 ± 0.201 0.400 ± 0.336 0.078 ± 0.186

[400, 600] 0.188 ± 0.197 0.387 ± 0.386 0.427 ± 0.375 0.161 ± 0.311

noisy-sphere-6

[100, 300] 0.024 ± 0.010 0.331 ± 0.237 0.358 ± 0.231 0.019 ± 0.057

[400, 600] 0.092 ± 0.074 0.434 ± 0.598 0.623 ± 0.596 0.050 ± 0.039

uniform

256 0.112 ± 0.082 0.221 ± 0.162 0.241 ± 0.171 0.182 ± 0.044

[200, 300] 0.175 ± 0.123 2.431 ± 2.162 4.058 ± 3.324 0.055 ± 0.053

ModelNet-small

[20, 200] 0.078 ± 0.095 0.178 ± 0.148 0.183 ± 0.148 0.023 ± 0.059

[400, 600] 0.163 ± 0.151 0.216 ± 0.252 0.227 ± 0.179 0.034 ± 0.031

ModelNet-large

2048 0.187 ± 0.335 0.281 ± 0.203 0.538 ± 0.298 0.172 ± 0.065

[1800, 2000] 0.185 ± 0.302 0.523 ± 0.526 33.086 ± 28.481 0.046 ± 0.039

RNAseq

[20, 200] 0.049 ± 0.029 0.508 ± 0.291 0.490 ± 0.271 0.024 ± 0.009

[400, 600] 0.281 ± 0.057 0.533 ± 0.300 0.568 ± 0.317 0.987 ± 0.0002

Furthermore, we deﬁne the multisymmetric power-sum polynomials:

Deﬁnition E.2 (Multisymmetric power-sum polynomials). Let α = (α

, . . . , α

) ∈ N

Given x = (x

, . . . , x

), let

= x

···x

The multisymmetric power sum with multi-degree α is

Among them, we will consider the set of elementary multisymmetric power sums to be those

with |α| ≤ n.

Notice that there t =



n+k



multisymmetric power sums. Let α

, . . . , α

be the list of all

α ∈ N

such that |α| ≤ n.

Theorem E.3 ((Briand, 2004)). Let A be a ring in which n! is invertible. Then the multi-

symmetric power sums generate J

(A) as an A-algebra.

If we take A = R, we get that the multisymmetric power sum polynomials generate

. Now given a continuous function f : R

k×N

→ R which is invariant to permutations

of the columns, we know that f can be approximated by a polynomial p which is invariant

to permutations of columns (see (Maron et al., 2019) for a detailed argument). Such a

polynomial p is a multisymmetric polynomial in N families of k variables with coeﬃcients

in R i.e., p ∈ J

. Given x ∈ R

ϕ(x) = [x

, . . . , x

]

Table 8: Comparison of the mean relative error versus overall computation time for 300

approximations of 2-Wasserstein distance for N

ProductNet

and the Sinkhorn distance. The

parameter ϵ controls the accuracy of the Sinkhorn approximation with lower ϵ corresponding

to a more accurate approximation.

Dataset N

ProductNet

Sinkhorn (ϵ = 0.10) Sinkhorn (ϵ = 0.01)

ModelNet-small

Error 0.078 ± 0.097 0.232 ± 0.132 0.019 ± 0.057

Time (s) 0.208 1.165 9.857

ModelNet-large

Error 0.187 ± 0.335 0.363 ± 0.255 0.172 ± 0.089

Time (s) 2.841 6.079 36.265

uniform

Error 0.112 ± 0.082 0.0303 ± 0.022 0.182 ± 0.044

Time (s) 0.712 16.515 29.312

noisy-sphere-3

Error 0.054 ± 0.071 0.225 ± 0.093 0.078 ± 0.186

Time (s) 0.677 1.591 12.760

noisy-sphere-6

Error 0.024 ± 0.010 0.324 ± 0.316 0.023 ± 0.059

Time (s) 0.428 0.877 9.831

RNAseq

Error 0.049 ± 0.029 0.031 ± 0.014 0.024 ± 0.009

Time (s) 0.716 47.701 81.016

Then

i=1

ϕ(x

) =

























By Theorem E.3, we have that p

, . . . , p

will generate any polynomial in J

. Then we

have some polynomial q ∈ R[y

, . . . , y

] such that p = q(p

, . . . , p

E.2 Proofs from Section 3

E.2.1 Proof of Lemma 3.2

Proof. Let ϵ > 0. Since f is uniformly continuous, ∃δ such that for all (A

, . . . , A

), (A

′

, . . . , A

′

) ∈

× ··· × X

where d

×···×X

((A

, . . . , A

), (A

′

, . . . , A

′

)) < δ, we have |f(A

, . . . , A

) −

f(A

′

, . . . , A

′

)| < ϵ. Since for any δ > 0 and any i ∈ [m], (X

, d

) has a (δ, a, G)-sketch, we

know that there is an a

∈ N

where there are continuous h

: X

→ R

and g

: R

→ X

where d

◦ h

(A), A) < δ/m for each A ∈ X

Let g

′

: R

×···×R

→ X

×···×X

be deﬁned as (u

, . . . , u

) 7→ (g

), . . . , g

)).

Since d

◦ h

), A

) < δ/k for A

∈ X

and i ∈ [m],

×···×X

((g

◦ h

), . . . , g

◦ h

)), (A

, . . . , A

)) < δ.

Let ρ = f ◦ g

′

and ϕ

= h

Then

|f(A

, . . . , A

) − ρ(ϕ

), . . . , ϕ

))| = |f(A

, . . . , A

) − f ◦ g

′

), . . . , h

))|

= |f(A

, . . . , A

) − f(g ◦ h(A

), . . . , g ◦ h(A

))| < ϵ.

Note that if X

= X

= ··· = X

and G

= G

= ··· = G

, we can use the same

encoding and decoding function - h

and g

, respectively - for all X

. Thus, in this case,

= ϕ

= ··· = ϕ

E.2.2 Proof of Lemma 3.3

Proof. Using the same argument as Theorem 3.2, we know that for ϵ/2, there is a continuous

h : X → R

and g : R

→ R such that

|f(A

, . . . , A

) − f(g ◦ h(A

), . . . , g ◦ h(A

))| <

As before, let g

′

: R

a×m

→ R be g

′

, . . . , u

) = (g(u

), . . . , g(u

)). Take F : R

a×m

→ R as

F (u

, . . . , u

) = f ◦g

′

, . . . , u

) = f(g(u

), . . . , g(u

)) where u

represents the ith column

of an element in R

a×m

. Note that F is continuous and invariant to permutations of the

columns. Let t =



a+m



. Therefore, by Theorem 2.1, there is a γ : R

→ R

and ρ such that



f ◦ g

′

, . . . , v

) − ρ



i=1

γ(v

)





Now set ϕ = γ ◦ h which is a function from R

→ R

. We thus have that



f(A

, . . . , A

) − ρ



i=1

ϕ(A

)





f(A

, . . . , A

) − f ◦ g

′

(h(A

), . . . , h(A

)) + f ◦ g

′

(h(A

), . . . , h(A

)) − ρ



i=1

ϕ(A

)





= ϵ

This completes the proof.