-
Notifications
You must be signed in to change notification settings - Fork 400
Add a "opaque_id" column to submissions. #1419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
What is an opaque_id? Is it to avoid disclosing the submission.id to users? |
yes |
|
What is an opaque_id really? And why we need it? It would be good to update the PR's description. |
|
We need submission identifiers for the API (PR #1335). But using database IDs assigned serially would leak information on the number of submissions by other contestants. |
I believe the PR description already explains:
|
|
@veluca93 not really (since the existing ID would also work, arguably). If you can add Martin's sentence |
Not sure it adds that much info, but sure, done. |
|
i'm not sure i like this approach much, because of the possibility of collisions. even though it would take around 2^31 submissions in the DB before collisions become likely enough to be a problem, it still seems a bit unclean to me. (also, if there is collision-avoidance in the generate_opaque_id function, then we should also implement similar logic in the dumpupdater, to at least ensure the IDs don't collide within a dump.) if we want to make collisions so unlikely as to make them really negligible, i think it'd be better to use UUIDs. alternatively we can use an encrypted form of the database ID (encrypted with e.g. the secret_key in cms.conf). to be clear i'm not really opposed to the current implementation, it's probably easier than my proposed alternatives. (well, uuids actually shouldn't be more difficult.) |
IMO if IDs collide in a dump you can just retry, that's unlikely enough. During normal operations collisions are indeed avoided. @gollux mentioned that UUID had some issues in postgres, but I don't remember exactly what. I am opposed to using a function of the submission ID, because it introduces the possibility of mistakes leaking information. |
gollux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more notes as separate comments.
|
My original proposal was to have the opaque IDs scoped to the given participation, while you have them globally unique. I still think that a more restricted scope (at least a contest if not a participation) is better. One reason is that it reduces the probability of collisions, the other that it allows preserving the IDs in contest dumps without fear of collisions when importing. Also, when a dump is upgraded from an older format, it is sufficient to avoid collisions in newly assigned IDs. |
|
Regarding ID format: UUIDs are popular, but quite unwieldy for both people and machines. In particular, they make poor index fields - they consume lots of space and they are allocated randomly, so a B-tree based index grows at all places simultaneously, as opposed to local growth when using sequential IDs. The latter problem is solved by UUIDv7, but it is still impractically large. Using 64-bit IDs saves space (and makes the IDs more human-friendly, especially if we decide to encode them in a base larger than 10), but if we allocate them at random, they suffer from the same performance issues in B-trees. Choosing 64-bit IDs whose upper 32-bits are a UNIX timestamp and the lower bits random would make index performance better. (If the IDs are scoped globally or to a contest; scoping per participation is more sensitive, you need to make the index on ID first and participation second.) Still, it's a question of how much do we care: with 400 contestants, 3 tasks in a contest, and 30 submissions per task, it's just 36000 submissions in total and the index is going to be tiny. So even fully random IDs scoped to a contest are going to perform quite reasonably.) |
|
Just to cover all bases: how bad is it really to leak information on the total number of submissions? |
That's not a healthy approach to security :) Contestants are supposed to be separated, so all leaks are bad to some degree. Of course there are some leaks that are almost unavoidable. One such example is evaluation delays. We can think about them and argue that they give no real advantage to contestants. But when the number of leaks grows, you must argue that even a combination of all the leaks gives no advantage, which is fiendishly hard. So whenever a leak is easy to stop, we want to do it, so that the few remaining unavoidable leaks will be manageable. |
|
What I mean is: imagine that CMS included a random piece of stats in the user-visible page (let's say right next to the countdown in the page) which read "Total submissions: 1234" and you maybe even see the number grow in real time. It's not that crazy to think, there are a lot of assumptions that early versions of CMS made. (In one IOI, I remember we even experimented with showing a breakdown on where the points where being distributed, so that weaker contestants would know on which task to focus their attention on). Now, in this scenario, would we be worried about this already-existing leak and would we be trying to remove this information? Or would we keep it as it is since it's there and it doesn't actually hurt? |
|
In that case, I would want to make the visibility of such stats configurable. |
|
OK so you think it gives some kind of advantage to be able to see that information? That's fair. As for this point made by @veluca93:
What kind of mistakes do you mean? If I understood correctly, the "function of the submission ID" approach would re-use the secret key configured in cms.toml to get some reversible hash, for which there is probably some standard function out there (likely already among our dependencies). |
I am not sure that it gives. But I am sure that contestants are much more ingenious than me, so I prefer minimizing leaks to quantifying advantages obtainable from the leaks :) |
I think that a stronger argument against basing public IDs on internal sequential IDs is that you lose the possibility of dump imports preserving original public IDs. |
7000118 to
bf03a00
Compare
Switched this to participation-scope, PTAL. |
This gives a stable ID that can be used to refer to a specific submission.
This gives a stable ID that can be used to refer to a specific submission.
IDs will be used for the submission API. We can't use submission IDs as they leak information on the total number of submissions.
Quick SQL snippet to update an existing database: