Skip to content

Add support to Asynchronous API #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: v4.1-dev
Choose a base branch
from

Conversation

ake123
Copy link
Contributor

@ake123 ake123 commented May 22, 2025

Add support to Asynchronous API #319

@pitkant
Copy link
Member

pitkant commented May 22, 2025

Did you test this with some dataset? For me it produces nonsensical output:

Ah, I inputed the original dataset ID into asynchronous function, so of course it would output nonsensical results. Maybe it would be useful to warn the user for this kind of use, see the history of this comment for example output. I will test this more.

@pitkant
Copy link
Member

pitkant commented May 22, 2025

I'm sure you have already read the Eurostat documentation on this but I'll just copy-paste it here

Examples queries

1 - Query in range for asynchronous extraction

Following query would be considered within limits and processed by the system

http://ec.europa.eu/eurostat/api/comext/dissemination/sdmx/2.1/data/DS-045409/A.DK.US..1.SUPPLEMENTARY_QUANTITY?format=SDMX_2.1_STRUCTURED

This query matches the following positions:

freq -> 1 position ("A")
reporter 1 position ("DK")
partner -> 1 position ("US")
product -> 40321 positions (there is no filter on this dimension)
flow -> 1 position ("1")
time_period -> 36 positions (there is no explicit filter on this dimension but the system will only return yearly data)
indicators -> 1 position ("SUPPLEMENTARY_QUANTITY")
Estimated cost: 1 x 1 x 1 x 40321 x 1 x 36 x 1 = 1 451 556 which is above the synchronous limit but below the maximum extraction limit so this request is treated asynchronously.

2 -Query above range for asynchronous extraction

Following query would be considered off limits and not processed by the system

https://ec.europa.eu/eurostat/api/comext/dissemination/sdmx/2.1/data/DS-045409/A.PT...2.QUANTITY_IN_100KG?format=SDMX_2.1_STRUCTURED1

This query matches the following positions:

freq -> 1 position ("A")
reporter 1 position ("PT")
partner -> 282 positions (there is no filter on this dimension)
product -> 40321 positions (there is no filter on this dimension)
flow -> 1 position ("2")
time_period -> 36 positions (there is no explicit filter on this dimension but the system will only return yearly data as the frequency requested is annual)
indicators -> 1 position ("QUANTITY_IN_100KG")
Estimated cost: 1 x 1 x 282 x 40321 x 1 x 36 x 1 = 409 338 792 which is above the maximum extraction limit of 5 000 000 cells and an error is returned.

I think when it comes to triggering the asynchronous request and keeping things within the Fair use limits we should refrain from running any automated tests on this functionality.

Fair use of the service

A request for data extraction will be forced to be processed asynchronously based on the evaluation of 3 main criteria:

  • the number of concurrent data extraction requests
  • the number of requests performed during a period
    • per day
    • during the last 7 days
    • during the last 30 days
  • the cumulative "extraction cost" generated during a period
    • per day
    • during the last 7 days
    • during the last 30 days
      If one of the above criteria exceeds some thresholds, further data extraction requests will be forced to be processed asynchronously and this as long as the rule is violated.

In order to avoid this, we recommend to:

  • trigger 1 extraction request at a time
  • in case of use of scripts, don't use parallelisation
  • if applicable, get data from the bulk download

Maybe 1 of this type of request could be recorded and used as a dummy? I don't immediately have the answer to that

Btw, while trying to trigger the asynchronous response I tested get_eurostat_sdmx with the dataset "bop_iip6_q". Curiously, I was able to download the whole dataset with no filters and all 57,312,860 rows. I thought it would've for sure triggered the async response or an error, but no. It was very slow though.

@ake123
Copy link
Contributor Author

ake123 commented May 22, 2025

i used some queries like the one below to trigger async but the thing is you can't use the same query again as it is cached by the server I guess so it changes to synchronous mode and says "Synchronous mode: CSV data returned directly."

dat <- get_eurostat_sdmx(
id = "DS-045409",
filters = list(
FREQ = "A",
FLOW = "1",
REPORTER = c("FI", "SE","ES"),                
PARTNER = c( "US"),           
INDICATORS = "SUPPLEMENTARY_QUANTITY"
),
agency = "eurostat_comext",
type = "code",
wait = 10,
max_wait = 600
)

@pitkant
Copy link
Member

pitkant commented May 23, 2025

I'm satisfied if you can get the function working just once. We can mark it with an Experimental tag (or similar) in the man pages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants