Troubleshooting, Monitoring & Best Practices (AB-900) Flashcards
Microsoft 365 Certified: Copilot and Agent Administration Fundamentals AB-900 Flashcards

| Front | Back |
| Action when an agent becomes unresponsive | Restart the agent capture diagnostics and collect last logs for analysis |
| Best practice for blue green deployments | Use traffic shifting to validate new release then roll back quickly on failure |
| Best practice for configuration management | Store configs in version control and use immutable deployments for reproducibility |
| Best practice for dependency updates | Pin dependency versions run automated tests and deploy to canary before full rollout |
| Best practice for logging sensitive data | Mask or redact PII at source and avoid logging secrets |
| Common cause of high API costs | Excessive token usage or inefficient prompting and lack of caching |
| Common cause when Copilot returns irrelevant answers | Insufficient context or wrong system prompt; provide clearer context; update system prompt and retry |
| First step when latency spikes occur | Check resource utilization CPU memory and network then correlate with recent deployments |
| How to audit changes that caused regression | Check commit history CI pipeline artifacts and rollback to stable release |
| How to collect logs for an agent | Enable debug logging in agent config; gather application logs system logs and transport logs |
| How to confirm data exfiltration risk | Check outbound connections access logs and unusual data transfer patterns |
| How to detect security incidents | Monitor for unusual authentication attempts privilege escalations and unexpected outbound traffic |
| How to diagnose memory leaks in agents | Monitor memory growth over time using heap dumps and profiler captures |
| How to handle corrupted model cache | Clear cache restart service and warm cache with known good requests |
| How to handle rate limit errors | Implement exponential backoff retries and request batching where possible |
| How to monitor latency percentiles | Track p50 p90 and p99 and prioritize fixes based on p99 impact |
| How to monitor model prompt usage | Log prompts and correlate with cost and performance while applying privacy controls |
| How to perform root cause analysis for errors | Reproduce issue capture logs and traces then narrow down to code or infra change |
| How to prevent replay attacks | Implement nonces timestamps and short lived tokens |
| How to profile CPU hot spots | Use sampling profilers to find functions with highest CPU time and optimize or refactor |
| How to reduce cold start latency | Keep warm instances use lightweight initialization and preload models where possible |
| How to reproduce intermittent failures | Record input and environment state then run stress tests with same load profile |
| How to secure logs in transit and at rest | Use TLS for transport and encryption with access controls for stored logs |
| How to test disaster recovery plans | Run scheduled failover drills and validate data integrity and recovery time objectives |
| How to tune prompt length for performance | Minimize context to necessary tokens and cache static context where possible |
| How to validate agent permissions | Audit IAM roles and least privilege assignments and run permission checks |
| Indicator of throttling at network layer | Increase in connection resets timeouts or HTTP 429 responses from services |
| Key indicator of model degradation | Shift in user satisfaction scores or sudden drop in task completion rate |
| Primary metric to monitor agent health | Heartbeat or alive signal frequency and success rate |
| Recommended alerting strategy | Avoid alert fatigue by setting severity thresholds and routing to oncall with runbooks |
| Recommended retention policy for logs | Keep high fidelity logs short term for debugging and aggregated summaries longer term |
| Recommended sleep strategy for retry logic | Use exponential backoff with jitter to avoid thundering herd problems |
| Steps for secure incident response | Isolate affected systems preserve evidence rotate credentials and perform forensic analysis |
| Tool to centralize logs across instances | Use a log aggregator like Elasticsearch Splunk or a hosted logging service |
| Typical resolution for authentication failures | Verify credentials and tokens check clock skew and refresh or rotate keys |
| What to do on discovery of leaked keys | Revoke keys rotate secrets and search logs for suspicious usage |
| What to include in a diagnostic bundle | Application logs config files traces metrics and recent deployment manifests |
| When to enable tracing | Enable distributed tracing for requests that span multiple services to identify bottlenecks |
| When to increase agent concurrency | When CPU and memory headroom exist and response latency remains acceptable |
| When to scale horizontally vs vertically | Scale horizontally for stateless services and vertically for single process bound by CPU |
About the Flashcards
Flashcards for the Microsoft 365 Certified: Copilot and Agent Administration Fundamentals exam focus on the operational lifecycle of intelligent agents. Review how to configure debug logging, centralize log streams, and protect sensitive information while meeting retention policies. Learn the key health metrics, heartbeat signals, and latency percentiles that reveal performance trends, plus effective alerting approaches that avoid fatigue.
The deck also guides you through troubleshooting techniques for high latency, memory leaks, rate limits, and API cost spikes. It covers secure incident response, blue-green deployments, scaling decisions, and prompt optimization to keep models responsive and cost-effective. Use these cards to reinforce terminology and best-practice workflows that often appear on exam scenarios.
Topics covered in this flashcard deck:
- Agent monitoring metrics
- Logging & tracing
- Performance troubleshooting
- Security & incident response
- Scaling and deployments