A fintech startup is developing a machine-learning model to predict credit-card default probability. Historical portfolio data contain only 0.08 % defaults, and regulators prohibit the company from transmitting any customer PII outside its private cloud. Senior management asks the data-science lead to obtain additional data within seven days, incur no recurring licensing costs, and allow the team to publish aggregate model-validation results in a peer-reviewed journal. Which data-acquisition strategy best satisfies all of these constraints?
Conduct a randomized micro-lending experiment issuing short-term loans to new customers and tracking real-world repayment outcomes.
Field an online survey asking consumers to self-report their borrowing and default behavior.
Generate synthetic loan-level data that statistically mimic the existing portfolio while oversampling rare default events.
Purchase an individual-level credit-bureau dataset containing historical defaults under a multi-year Fair Credit Reporting Act licensing agreement.
Generating synthetic loan application and repayment records lets the team oversample rare default events, control class balance, and avoid exposing any real PII, so the data can be produced quickly and shared in aggregate without ongoing fees. Purchasing a commercial credit-bureau file would introduce regulated PII, require lengthy licensing agreements, and restrict publication. Conducting a real-world micro-lending experiment might yield ground-truth defaults, but it is expensive, carries regulatory risk, and cannot be completed in a week. A consumer survey can be run rapidly but provides self-reported information that lacks verified default events and still leaves the extreme-class imbalance unresolved.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is synthetic data, and how is it generated?
Open an interactive chat with Bash
Why is oversampling rare default events important in machine learning?
Open an interactive chat with Bash
What is the significance of avoiding PII transmission in private cloud environments?