A data science team is developing a production pipeline to retrieve customer transaction data from a third-party REST API for a near-real-time analytics dashboard. The API is subject to a strict rate limit, returns data in paginated JSON format, and supports server-side filtering via query parameters. The pipeline must run continuously to fetch the latest data. Which of the following data retrieval strategies is the most robust and scalable for this production environment?
Implement an iterative process to handle pagination, using the pointers provided in the API's response. Before each request, check for and adhere to rate-limit information (e.g., X-RateLimit-Remaining). Employ server-side filtering to request only data updated since the last successful retrieval.
Utilize a high-concurrency approach with multiple parallel threads to request all data pages simultaneously, reducing ingestion time. Store the API key in a configuration file within the project repository for access by all threads.
Make a single API call to the base endpoint, requesting all available data. Cache the entire response, and then perform filtering and transformation on the client-side to identify new transactions.
Configure a simple loop to make repeated requests until a 429 (Too Many Requests) error is received. Upon receiving a 429 error, pause the script for a fixed duration of 60 seconds before resuming requests.
The correct option describes a robust and scalable strategy by addressing all key constraints. It properly handles pagination by using the API's provided pointers. It respects rate limits proactively by checking headers like X-RateLimit-Remaining before making requests, which is more efficient than reactively waiting after an error. Finally, it uses server-side filtering to request only new data, which minimizes data transfer and client-side processing. Requesting all data and filtering on the client-side is inefficient and does not scale. Using high-concurrency requests without regard for rate limits will result in the client being throttled or blocked. A reactive approach to rate limiting, where the script only pauses after receiving a 'Too Many Requests' error, is inefficient and less reliable than a proactive approach.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a REST API, and why is it used in data pipelines?
Open an interactive chat with Bash
What is API rate limiting and how does the `X-RateLimit-Remaining` header help manage it?
Open an interactive chat with Bash
What is server-side filtering, and why is it better than client-side filtering in this context?