Skip to content

Conversation

@veluca93
Copy link
Contributor

@veluca93 veluca93 commented Jun 18, 2025

This gives a stable ID that can be used to refer to a specific submission.

IDs will be used for the submission API. We can't use submission IDs as they leak information on the total number of submissions.

Quick SQL snippet to update an existing database:

begin;
alter table submissions add column opaque_id BIGINT;
update submissions set opaque_id = id where opaque_id IS NULL;
ALTER TABLE submissions ADD CONSTRAINT participation_opaque_unique UNIQUE (participation_id, opaque_id);
commit;

@veluca93 veluca93 requested a review from gollux June 18, 2025 23:51
@veluca93 veluca93 requested a review from wil93 as a code owner June 18, 2025 23:51
@wil93
Copy link
Member

wil93 commented Jun 19, 2025

What is an opaque_id? Is it to avoid disclosing the submission.id to users?

@veluca93
Copy link
Contributor Author

What is an opaque_id? Is it to avoid disclosing the submission.id to users?

yes

@wil93
Copy link
Member

wil93 commented Jun 19, 2025

What is an opaque_id really? And why we need it? It would be good to update the PR's description.

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

We need submission identifiers for the API (PR #1335). But using database IDs assigned serially would leak information on the number of submissions by other contestants.

@veluca93
Copy link
Contributor Author

What is an opaque_id really? And why we need it? It would be good to update the PR's description.

I believe the PR description already explains:

This gives a stable ID that can be used to refer to a specific submission.

@wil93
Copy link
Member

wil93 commented Jun 19, 2025

@veluca93 not really (since the existing ID would also work, arguably).

If you can add Martin's sentence We need submission identifiers for the API (PR https://github.com/cms-dev/cms/pull/1335). But using database IDs assigned serially would leak information on the number of submissions by other contestants. to the PR's description it would be much better.

@veluca93
Copy link
Contributor Author

@veluca93 not really, if you can add Martin's sentence We need submission identifiers for the API (PR https://github.com/cms-dev/cms/pull/1335). But using database IDs assigned serially would leak information on the number of submissions by other contestants. to the PR's description it would be much better.

Not sure it adds that much info, but sure, done.
Also note that you can modify PR descriptions.

@prandla
Copy link
Member

prandla commented Jun 19, 2025

i'm not sure i like this approach much, because of the possibility of collisions. even though it would take around 2^31 submissions in the DB before collisions become likely enough to be a problem, it still seems a bit unclean to me. (also, if there is collision-avoidance in the generate_opaque_id function, then we should also implement similar logic in the dumpupdater, to at least ensure the IDs don't collide within a dump.)

if we want to make collisions so unlikely as to make them really negligible, i think it'd be better to use UUIDs. alternatively we can use an encrypted form of the database ID (encrypted with e.g. the secret_key in cms.conf).

to be clear i'm not really opposed to the current implementation, it's probably easier than my proposed alternatives. (well, uuids actually shouldn't be more difficult.)

@veluca93
Copy link
Contributor Author

i'm not sure i like this approach much, because of the possibility of collisions. even though it would take around 2^31 submissions in the DB before collisions become likely enough to be a problem, it still seems a bit unclean to me. (also, if there is collision-avoidance in the generate_opaque_id function, then we should also implement similar logic in the dumpupdater, to at least ensure the IDs don't collide within a dump.)

if we want to make collisions so unlikely as to make them really negligible, i think it'd be better to use UUIDs. alternatively we can use an encrypted form of the database ID (encrypted with e.g. the secret_key in cms.conf).

to be clear i'm not really opposed to the current implementation, it's probably easier than my proposed alternatives. (well, uuids actually shouldn't be more difficult.)

IMO if IDs collide in a dump you can just retry, that's unlikely enough.

During normal operations collisions are indeed avoided.

@gollux mentioned that UUID had some issues in postgres, but I don't remember exactly what.

I am opposed to using a function of the submission ID, because it introduces the possibility of mistakes leaking information.

Copy link
Contributor

@gollux gollux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more notes as separate comments.

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

My original proposal was to have the opaque IDs scoped to the given participation, while you have them globally unique.

I still think that a more restricted scope (at least a contest if not a participation) is better. One reason is that it reduces the probability of collisions, the other that it allows preserving the IDs in contest dumps without fear of collisions when importing. Also, when a dump is upgraded from an older format, it is sufficient to avoid collisions in newly assigned IDs.

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

Regarding ID format: UUIDs are popular, but quite unwieldy for both people and machines. In particular, they make poor index fields - they consume lots of space and they are allocated randomly, so a B-tree based index grows at all places simultaneously, as opposed to local growth when using sequential IDs. The latter problem is solved by UUIDv7, but it is still impractically large.

Using 64-bit IDs saves space (and makes the IDs more human-friendly, especially if we decide to encode them in a base larger than 10), but if we allocate them at random, they suffer from the same performance issues in B-trees. Choosing 64-bit IDs whose upper 32-bits are a UNIX timestamp and the lower bits random would make index performance better. (If the IDs are scoped globally or to a contest; scoping per participation is more sensitive, you need to make the index on ID first and participation second.)

Still, it's a question of how much do we care: with 400 contestants, 3 tasks in a contest, and 30 submissions per task, it's just 36000 submissions in total and the index is going to be tiny. So even fully random IDs scoped to a contest are going to perform quite reasonably.)

@wil93
Copy link
Member

wil93 commented Jun 19, 2025

Just to cover all bases: how bad is it really to leak information on the total number of submissions?

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

Just to cover all bases: how bad is it really to leak information on the total number of submissions?

That's not a healthy approach to security :) Contestants are supposed to be separated, so all leaks are bad to some degree.

Of course there are some leaks that are almost unavoidable. One such example is evaluation delays. We can think about them and argue that they give no real advantage to contestants.

But when the number of leaks grows, you must argue that even a combination of all the leaks gives no advantage, which is fiendishly hard. So whenever a leak is easy to stop, we want to do it, so that the few remaining unavoidable leaks will be manageable.

@wil93
Copy link
Member

wil93 commented Jun 19, 2025

What I mean is: imagine that CMS included a random piece of stats in the user-visible page (let's say right next to the countdown in the page) which read "Total submissions: 1234" and you maybe even see the number grow in real time. It's not that crazy to think, there are a lot of assumptions that early versions of CMS made. (In one IOI, I remember we even experimented with showing a breakdown on where the points where being distributed, so that weaker contestants would know on which task to focus their attention on).

Now, in this scenario, would we be worried about this already-existing leak and would we be trying to remove this information? Or would we keep it as it is since it's there and it doesn't actually hurt?

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

In that case, I would want to make the visibility of such stats configurable.

@wil93
Copy link
Member

wil93 commented Jun 19, 2025

OK so you think it gives some kind of advantage to be able to see that information? That's fair.

As for this point made by @veluca93:

I am opposed to using a function of the submission ID, because it introduces the possibility of mistakes leaking information.

What kind of mistakes do you mean? If I understood correctly, the "function of the submission ID" approach would re-use the secret key configured in cms.toml to get some reversible hash, for which there is probably some standard function out there (likely already among our dependencies).

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

OK so you think it gives some kind of advantage to be able to see that information? That's fair.

I am not sure that it gives. But I am sure that contestants are much more ingenious than me, so I prefer minimizing leaks to quantifying advantages obtainable from the leaks :)

@gollux
Copy link
Contributor

gollux commented Jun 19, 2025

What kind of mistakes do you mean? If I understood correctly, the "function of the submission ID" approach would re-use the secret key configured in cms.toml to get some reversible hash, for which there is probably some standard function out there (likely already among our dependencies).

I think that a stronger argument against basing public IDs on internal sequential IDs is that you lose the possibility of dump imports preserving original public IDs.

@veluca93 veluca93 force-pushed the submission-ids branch 2 times, most recently from 7000118 to bf03a00 Compare June 19, 2025 20:40
@veluca93
Copy link
Contributor Author

My original proposal was to have the opaque IDs scoped to the given participation, while you have them globally unique.

I still think that a more restricted scope (at least a contest if not a participation) is better. One reason is that it reduces the probability of collisions, the other that it allows preserving the IDs in contest dumps without fear of collisions when importing. Also, when a dump is upgraded from an older format, it is sufficient to avoid collisions in newly assigned IDs.

Switched this to participation-scope, PTAL.

This gives a stable ID that can be used to refer to a specific
submission.
@veluca93 veluca93 merged commit 7d7eb3c into cms-dev:main Jun 21, 2025
3 checks passed
@wil93 wil93 mentioned this pull request Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants