SQR-119
Metrics-driven user restriction for the RSP#
Abstract
Some users of the Rubin Science Platform may use the platform in ways that do not exceed our API rate throttling but that consume enough cumulative resources that they interfere with other users. One example is bulk download of all Rubin data via RSP APIs. This technote provides a design of a system to detect and react to such user patterns by throttling the user’s use of resources.
Problem statement#
Bandwidth from the storage location of some types of Rubin data to the Rubin Science Platform or to external users is a limited resource that must be managed. A few overeager users could potentially exhaust the available bandwidth and cause performance degredation for other users who are using the platform as intended. However, in this case, each individual request by an overeager user is legitimate. The problem is caused by the volume.
Currently, the Rubin Science Platform does some simple rate limiting of API calls to protect the services. This is only a gross, low-level limit on the number of HTTP requests that can be made to a service and cannot take into account the action the user is performing. It is feasible for a user to make requests well below the rate limits that nonetheless represent a disproportionate and unsupportable use of resources.
One concrete example is bulk download of data. The Rubin Science Platform APIs are intended for interactive access to Rubin data, but the same APIs can feasibly be used to download large amounts of Rubin’s scientific data. This, however, could quickly exhaust the available bandwidth for serving data and thus impact other users. (The intended long-term solution to this problem is a data bulk download service, which has not yet been written at the time this technote was written.)
We therefore need to detect users who are placing excessive demands on the system in a way that cannot be detected by simple rate limiting and then take some remedial action to ask them to stop and, if necessary, reduce their usage to acceptable levels.
Initial target#
We will start by solving the following concrete problem:
Users may attempt to use the Rubin Science Platform APIs to download large numbers of Rubin images and thus threaten to exhaust the available bandwidth to the data storage. The RSP should automatically detect such attempts by adding up the size of data that the user has retrieved and alerting the user to reduce their usage once they past some threshold. If that alert does not change their behavior, the user’s use of the relevant APIs should be throttled sufficiently to ensure that their continued use will not impact other users of the platform.
This is a very specific example of what we believe will be a larger set of similar problems, so the system should be designed for future generalization.
This proposal assumes the existence of a per-user notification system as discussed in SQR-118.
Proposed design#
Record keeping#
Ideally, we would count and log the bytes that the user downloaded. Unfortunately, this is difficult to do in the current Rubin Science Platform architecture, since RSP API services provide the user with a signed URL and the user then downloads the resulting file directly from the underlying storage (or has some other component, such as the Portal Aspect, do that for them). At present, that storage is provided by Weka at SLAC and we do not have easy visibility into the size of files the user is downloading or a way to tie a specific file access to a user.
For the initial implementation, we will take the compromise approach of assuming that whenever we provide the user with a signed URL, the user will download the file that URL points to. This will overcount user usage, since users may decline to download some resources or may download only byte ranges of some files. We are not sure how significant that overcount will be. We will start collecting that data and then manually check it to see if it seems reasonable and is usable as an approximation.
There are three primary places where we make signed URLs to data of significant size available to the user: the SIAv2 service (SQR-095), the DataLink service DMTN-238) that provides image access URLs associated with TAP queries, and the Butler server (DMTN-242). In addition, the cutout service (DMTN-208) has to download the original images from which it is making cutouts, and may be in scope if we find that heavy use of that service creates significant bandwidth demands.
Currently, none of those services record the size of the data behind the signed URLs that they provide to the user. However, the Butler database does have that information (at least for most data files). The first step, therefore, is to obtain those file sizes from the Butler server and record that information for each user request. The initial target services for this will be the SIAv2 and DataLink services, since we think they are the most likely services that someone trying to download large quantities of images would use.
The SIAv2 service already logs metrics via Kafka and only needs to obtain the file size and include that in its metrics. The DataLink service currently does not record metrics; this functionality will need to be added.
Detection#
Once the file sizes are recorded in application metrics, that information will be available in the InfluxDB database underlying the Sasquatch metrics system used by the Rubin Science Platform. We can then use InfluxDB queries to find users who are exceeding reasonable thresholds on the quantity of data that they are (potentially) downloading.
For the initial proof of concept, we will see if we can write an InfluxDB query that returns the total bytes potentially downloaded by each user across some time range. We can then check the byte total against a configurable threshold and build a list of users who are potentially consuming more than their fair share of bandwidth.
Hopefully, this approach will generalize to other metrics-driven throttling cases: Record the relevant metric via Sasquatch, use an InfluxDB query to find users who are exceeding some threshold, and then take action on each other.
Response#
When we detect a user, the first action will be to send that user a per-user notification using the service described in SQR-118 asking them to reduce their usage and referring them to some static documentation about acceptable usage.
If the user’s usage then drops, no further action is required. If the user continues to exceed the threshold, the next step will be to set restrictive API quotas for the user on the relevant services (SIAv2 and DataLink) via the Gafaelfawr quota mechanism. This will impact more than just the user’s large data downloads since the API rate limits are, as previously mentioned, not very granular. However, it should be sufficient to get the user’s attention and to prevent disruption to the usage of other users until the user changes their usage pattern.
Currently, Gafaelfawr only supports static quotas and an administrative override. This approach requires granular per-user overrides managed by another automated system, and therefore probably warrants a new API and separate storage for these quota overrides in Gafaelfawr, along with slightly more sophisticated logic for merging user quota restrictions across the various input sources.
Technical details#
This service seems like a natural fit for a Kubernetes CronJob running at whatever frequency we think balances time to response with the time windows across which we’re counting user usage and the load on InfluxDB from the queries.
That service should run with its own Gafaelfawr credentials obtained from a Kubernetes GafaelfawrServiceToken custom resource so that it can make API calls on its own behalf, such as to the user notification service and to Gafaelfawr’s new quota restriction API.
This service will need its own local data store to keep track of the status of each user that crossed a threshold, including such details as when a warning was issued and when the user’s API quota was restricted. That same data store can store the notification ID of the user notification so that the notification can be withdrawn if the user falls back below the usage threshold.
Open questions#
The initial proof of concept seeks to answer the following questions:
Will recording the size of every file for which we issue a signed URL be a sufficiently good proxy for the user’s actual downloads that we can use this as a basis of this metrics alerting and throttling system?
Can we write InfluxDB queries to efficiently detect users who have passed usage thresholds and will scale with large numbers of users?
Will the various intermediate services handle 429 responses from services correctly?
Is throttling chosen Gafaelfawr API quotas sufficient to both prevent immediate problems from excessive bandwidth use and in getting the user’s attention?
Will the initial user notification alter user behavior? How much time should we give the user after sending the initial notification?
Plan#
Add support to the Butler API to obtain the file size of a file for which the Butler server is issuing a signed URL, if this is not already present.
Obtain that file size in the SIAv2 service and include it in the success metrics.
Add metrics support to the DataLink service, obtain that file size, and include it in the success metrics.
Check the resulting InfluxDB data to see if it is reasonable as a proxy for actual user bandwidth usage or if we will have to find a different approach.
Implement the per-user notification service described in SQR-118.
Implement the metrics analysis service that runs the InfluxDB query and sends a per-user notification to users that have exceeded some threshold.
Implement the new Gafaelfawr API for overriding a user’s API quota limits.
Add quota API overrides to the metrics analysis service for users who have continued to exceed some threshold.