CompTIA Data+ Practice Test (DA0-002)
Use the form below to configure your CompTIA Data+ Practice Test (DA0-002). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

CompTIA Data+ DA0-002 (V2) Information
The CompTIA Data+ exam is a test for people who want to show they understand how to work with data. Passing this exam proves that someone can collect, organize, and study information to help businesses make smart choices. It also checks if you know how to create reports, use charts, and follow rules to keep data safe and accurate. CompTIA suggests having about 1 to 2 years of experience working with data, databases, or tools like Excel, SQL, or Power BI before taking the test.
The exam has different parts, called domains. These include learning basic data concepts, preparing data, analyzing it, and creating easy-to-read reports and visualizations. Another important part is data governance, which covers keeping data secure, private, and high quality. Each section of the test has its own percentage of questions, with data analysis being the largest part at 24%.
Overall, the CompTIA Data+ exam is a good way to prove your skills if you want a career in data. It shows employers that you know how to handle data from start to finish, including collecting it, checking it for errors, and sharing results in clear ways. If you enjoy working with numbers and information, this certification can be a great step forward in your career.

Free CompTIA Data+ DA0-002 (V2) Practice Test
- 20 Questions
- Unlimited time
- Data Concepts and EnvironmentsData Acquisition and PreparationData AnalysisVisualization and ReportingData Governance
A data analytics team requires shared access to a set of moderately sized datasets. The data is stored on a central server and accessed by team members through a mapped network drive on their workstations. This system organizes data in a familiar folder-and-file structure, allowing users to navigate through a directory tree to open files like .csv and .xlsx. Which storage type does this scenario describe?
Local storage
Block storage
File storage
Object storage
Answer Description
The correct answer is File storage. This scenario describes a classic file storage system, often implemented as Network Attached Storage (NAS). Key characteristics include its hierarchical organization (data stored in files within a tree of folders/directories) and its ability to provide shared access to multiple users over a network. The description of a 'mapped network drive' and a 'folder-and-file structure' directly aligns with the definition of file storage.
- Object storage is incorrect because it stores data in a flat structure of 'buckets' or 'containers', not a hierarchical file system. Access is typically handled via APIs rather than a browsable file system.
- Block storage is incorrect because it operates at a lower level, breaking data into fixed-size blocks. A file system is built on top of block storage, but block storage itself does not inherently present a folder/file structure to the end-user.
- Local storage is incorrect because it refers to storage directly attached to or inside a single user's computer (e.g., a local C: drive). The scenario specifies a 'central server' and shared access for the entire team, which points to a networked, not local, solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between file storage and object storage?
Why is block storage not suitable for shared access like file storage?
What is Network Attached Storage (NAS) and how does it relate to file storage?
A data analyst is working with a movie dataset where one of the columns, 'genres', contains a comma-separated string of all genres applicable to a single movie (e.g., 'Action,Adventure,Sci-Fi'). The analyst's objective is to calculate the total number of movies for each individual genre. To accomplish this, each genre for a given movie must be represented on its own row.
Which of the following data transformation techniques should the analyst use to restructure the 'genres' column for this analysis?
Binning
Parsing
Exploding
Imputation
Answer Description
The correct answer is Exploding. The 'exploding' transformation is used to convert a single row that contains a list-like structure (such as a comma-separated string or an array) into multiple rows, one for each element in the list. In this scenario, it would create a new row for each genre associated with a movie, allowing for accurate aggregation by genre.
- Parsing involves analyzing a string of text to break it into smaller, meaningful components. While parsing the 'genres' string is a necessary step to separate the genres, the overall technique that creates new rows from these components is known as exploding.
- Binning is a technique used to group a range of continuous numerical values into discrete intervals or 'bins'. It is not suitable for handling list-like categorical data.
- Imputation is the process of replacing missing values in a dataset with substituted values. This is not relevant to the task, as the problem is about restructuring existing data, not filling in missing data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'exploding' mean in data transformation?
How is 'parsing' different from 'exploding'?
When should you use 'binning' in data analysis?
At 10:00 a.m., the vice-president of marketing asks you to show the conversion-funnel results of yesterday's four-hour flash-sale in a noon executive meeting. No existing KPI dashboards include that promotion, and the standard sales-operations dashboard will not refresh with yesterday's data until its normal overnight schedule. To meet the VP's deadline with the least delay, which dashboard frequency should you choose?
Create an ad hoc dashboard that queries yesterday's sale data and share an immediate link with the VP.
Build a continuously streaming real-time dashboard that refreshes every few seconds on the intranet.
Add the flash-sale metrics to the nightly recurring dashboard and provide the refreshed version tomorrow.
Export the weekly KPI dashboard to PDF and email it before the meeting.
Answer Description
Because the request is time-sensitive, one-off, and outside the scope of the recurring overnight dashboard, the analyst should build and share an ad hoc dashboard. Ad hoc dashboards (or reports) are created on demand to answer a specific question and are not tied to a fixed refresh schedule. Waiting for the nightly recurring dashboard, sending a weekly static report, or engineering a continuous real-time stream would all miss the immediate noon deadline or require unnecessary effort for a single meeting.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an ad hoc dashboard?
How does an ad hoc dashboard differ from a recurring dashboard?
When would it be appropriate to use real-time dashboards instead of ad hoc dashboards?
A data analyst at an e-commerce company is tasked with creating a customer segment for a targeted direct mail campaign. The business rule for this campaign is that every selected customer record must have a complete mailing address (street, city, state, and postal code). The analyst runs a query to check the customer table and discovers that while the customer_id, city, and state columns are fully populated, 30% of the records have null values in the postal_code column. Which data quality issue is the primary concern for the analyst in this scenario?
Incompleteness
Redundancy
Outliers
Duplication
Answer Description
The correct answer is Incompleteness. Data completeness is a data quality dimension that measures whether all necessary data is present in a dataset. In this scenario, the postal_code is a required field for the direct mail campaign, and its absence in 30% of records makes the dataset incomplete for the intended business purpose.
- Duplication is incorrect because it refers to the same record appearing multiple times in a dataset. The problem described is missing data within records, not repeated records.
- Redundancy is incorrect as it refers to the same data being stored in multiple places, which can lead to inconsistencies but is not the issue here.
- Outliers are incorrect because they are data points that are present but deviate significantly from the other observations. The issue identified is the absence of data (null values), not the presence of abnormal data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'data incompleteness' mean in a data quality context?
Why is a null value in a dataset problematic for data analysis?
How can missing `postal_code` values be resolved when preparing data?
A data analyst at a healthcare organization is preparing a dataset for a university research study on patient outcomes. The dataset contains sensitive Personal Health Information (PHI). To comply with privacy regulations and protect patient identities, while still providing valuable data for statistical analysis, which of the following data protection practices is the MOST appropriate to apply before sharing the dataset?
Data masking
Role-based access control (RBAC)
Anonymization
Encryption at rest
Answer Description
The correct answer is Anonymization. Anonymization is the process of removing or modifying personally identifiable information (PII) and personal health information (PHI) to ensure that the individuals who are the subjects of the data cannot be identified. This technique is essential when sharing sensitive data for external research or statistical analysis, as it protects privacy while maintaining the analytical value of the data.
- Data masking is a plausible but less appropriate choice. Masking typically replaces sensitive data with realistic but fabricated data and is most often used for internal purposes like software testing or training where the data format must be preserved. While it is a form of data obfuscation, anonymization is the more precise term for the goal of making re-identification impossible for research purposes.
- Encryption at rest protects data that is stored on a disk or in a database by making it unreadable without a decryption key. It does not protect the data once it has been decrypted and shared with the third-party researchers.
- Role-based access control (RBAC) is a method for managing who can access data within a system based on their job function. It controls access permissions but does not alter the data itself to make it safe for sharing with an external entity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between anonymization and pseudonymization?
Why is encryption at rest insufficient for securely sharing PHI with researchers?
How does data masking differ from anonymization?
You are preparing a set of numeric customer-behavior features for a k-means clustering model. One of the variables, lifetime_value, is highly right-skewed and contains several extreme outliers that would dominate Euclidean distance calculations if left untreated. You want each feature to contribute proportionally to the distance metric without letting those few large values distort the scale. Which preprocessing technique should you apply before running the clustering algorithm?
Apply a logarithmic transformation followed by min-max scaling.
Apply Z-score standardization so each feature has mean 0 and standard deviation 1.
Apply min-max scaling to force every feature into a 0-1 range.
Apply a robust scaler that centers on the median and scales by the interquartile range.
Answer Description
Robust scaling uses the median as its center and the interquartile range (IQR) as its scale factor. Because both the median and IQR are resistant to extreme values, the transformation keeps most observations on a comparable scale while preventing a handful of very large lifetime_value records from stretching the range of the whole variable.
Min-max scaling rescales the data between fixed bounds such as 0 and 1, but the presence of even a single extreme value pushes all other observations into a narrow interval, so outliers still dominate. Standardization (Z-score) centers data on the mean and scales by the standard deviation; since both statistics are sensitive to outliers, the resulting values can still be overstretched by extreme cases. A log transform followed by min-max scaling can reduce skew but will still tie the upper bound of the scale to the largest remaining value, offering less protection than an IQR-based approach. Therefore, robust scaling is the most appropriate choice for this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Robust Scaling particularly suited for handling outliers?
How does Euclidean distance affect clustering when using raw data with outliers?
How does Min-Max Scaling compare to Robust Scaling in handling outliers?
A U.S.-based retailer wants to replicate its PostgreSQL production database, which stores personal data about European Union customers, to a cloud analytics cluster located in Singapore. To satisfy the jurisdictional requirement portion of data-compliance planning, which action should the data team perform first?
Confirm that transferring EU personal data to Singapore is permitted and implement an approved cross-border transfer mechanism (for example, Standard Contractual Clauses).
Validate that the destination cluster enforces column-level encryption for all sensitive fields.
Update the data dictionary to reflect schema changes introduced in the analytics environment.
Ensure the replication job meets the required recovery-point and recovery-time objectives.
Answer Description
Jurisdictional requirements focus on laws that apply because of where data is collected, stored, or accessed. Under the GDPR, Singapore is a "third country," so exporting EU personal data there is lawful only if an approved cross-border transfer mechanism-such as an adequacy decision, Standard Contractual Clauses, or Binding Corporate Rules-has been validated. Verifying and implementing that legal basis directly addresses the geographic (jurisdictional) aspect of compliance. The other options deal with technical safeguards (encryption), disaster-recovery timing, or metadata maintenance; none of those, by themselves, satisfy location-based legal obligations.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Standard Contractual Clauses (SCCs) under GDPR?
What is the legal significance of 'third countries' under GDPR?
What is the difference between jurisdictional requirements and technical safeguards for data compliance?
A data analyst is tasked with processing thousands of unstructured customer reviews from a company's website. The goal is to quickly identify common topics, summarize feedback for different product lines, and understand overall sentiment. Manually reading all the reviews is not feasible due to the volume of data. Which of the following AI concepts is the most appropriate for generating human-like text summaries from this unstructured data?
Large language model (LLM)
Robotic process automation (RPA)
A dimensional table
Foundational model
Answer Description
The correct answer is a large language model (LLM). LLMs are a type of artificial intelligence specifically designed to understand, process, and generate human-like text based on the vast amounts of data they are trained on. This makes them ideal for tasks like summarizing unstructured text, analyzing sentiment, and identifying themes from customer reviews.
- Robotic process automation (RPA) is incorrect because it is designed to automate repetitive, rule-based tasks by mimicking human interactions with digital systems, such as data entry or file transfers. It does not have the capability to interpret unstructured text and generate novel summaries.
- A dimensional table is a data warehouse structure that stores descriptive attributes to provide context to facts in a fact table. It is a way to organize structured data for analysis, not a tool for processing and summarizing unstructured text.
- While an LLM is a type of foundational model, 'foundational model' is a broader category that can include models for images, audio, and other data types, not just text. 'Large language model' is the more specific and therefore most appropriate choice for a task that exclusively involves processing and generating text.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
How does a Large Language Model (LLM) work?
Why is Robotic Process Automation (RPA) unsuitable for text summarization?
What is the difference between a foundational model and a large language model (LLM)?
A data analyst is preparing a 250 000-row customer data set to train a supervised churn-prediction model. The target column, Churn_Flag, contains Yes/No values for 248 700 customers, while the remaining 1 300 rows have NULL in that column only; every feature in those 1 300 rows is otherwise complete and within expected ranges. Exploratory checks show that dropping 1 300 records will not materially change the class balance or statistical power of the model. The machine-learning library being used will raise an error if the target variable is missing. Which data-cleansing technique is MOST appropriate for handling the 1 300 affected rows before modeling?
Delete the 1 300 rows that have a NULL value in Churn_Flag before training the model.
Apply min-max scaling to the numeric features so the algorithm can ignore the NULL labels.
Impute each missing Churn_Flag with the most common class so the overall distribution is preserved.
Bin Churn_Flag into broader categories and keep the rows to maximize training data size.
Answer Description
Because the missing values occur in the target variable-not in the predictor features-the rows cannot contribute to supervised learning. Imputing or transforming the missing target would inject fabricated labels and risk corrupting the model. Binning or scaling features does nothing to resolve the missing label, and the library will still fail. Given that the affected subset represents only 0.52 % of the data and its removal does not bias the class distribution, listwise deletion (dropping those rows) is the proper cleansing step. Imputing the mode would create false churn outcomes; scaling features leaves the NULLs untouched; and binning the target is impossible without a value to bin, so those choices are incorrect.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is it important to delete rows with NULL values in the target variable instead of imputing them?
What does class balance mean in the context of machine learning?
How does listwise deletion impact the statistical power of a machine learning model?
A data analyst must ensure that a consolidated revenue report is created, saved to a shared drive, and emailed to executives automatically at 03:00 every night, when no employees are logged on. When configuring the robotic process automation (RPA) workflow, which bot type or deployment model is the most appropriate for this fully automated reporting requirement?
Deploy a test-environment bot that executes the workflow only when a QA engineer approves a build.
Configure an unattended bot and schedule it in the RPA orchestrator to run at 03:00.
Use an attended bot that the analyst launches manually each morning after logging in.
Rely on a citizen-developer desktop recorder that operates only while the analyst is active.
Answer Description
Unattended bots are designed to run end-to-end automations without any human trigger. They can be scheduled in an orchestrator or control room to start at a specific time, log in to systems, execute every step of the workflow, and distribute the finished report. Attended bots require a user to start them and therefore cannot guarantee execution at 03:00 when staff are offline. Citizen-developer desktop recordings are intended for design or personal productivity, not for lights-out production runs. Test or QA bots are used for validation in non-production environments and are not scheduled in production for routine reports. Because the scenario calls for lights-out, time-based execution, the unattended, scheduler-driven bot is the only suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an unattended bot?
What is the role of an RPA orchestrator?
How does an unattended bot differ from an attended bot?
A multinational retailer replicates its EU customer database from Frankfurt to several cloud regions worldwide for disaster-recovery analytics. During a GDPR compliance audit, the assessor finds that (1) the data is copied daily to U.S. and APAC regions without any approved transfer mechanism and (2) snapshots of the replicated database have been kept for three years because no retention policy exists. Which set of compliance measures would BEST remediate both findings while still allowing the business to keep a global backup?
Require unit and user-acceptance testing for each region and tag every snapshot with metadata to identify its business owner.
Tokenise cardholder data, reclassify the database under PCI DSS, and increase snapshot frequency so that no records are lost.
Adopt Standard Contractual Clauses (or another Article 46 safeguard) for the international transfers, restrict replication to approved regions, and create a documented retention schedule that deletes or anonymises snapshots once the business purpose expires.
Encrypt the database in transit and at rest, mask sensitive columns, and enable automated data-quality profiling to detect drift.
Answer Description
Under GDPR, personal data may only be transferred outside the European Economic Area if the controller applies an approved safeguard such as Standard Contractual Clauses (SCCs) or Binding Corporate Rules. This directly addresses the cross-border replication that the auditor flagged. GDPR's storage-limitation principle also requires organisations to delete or anonymise personal data when it is no longer needed, so a documented retention schedule for snapshots is necessary. Restricting or geo-fencing replication to approved regions ensures ongoing jurisdictional compliance while still permitting a disaster-recovery copy. The distractors focus on security controls (encryption, masking), PCI-specific requirements, or testing activities; none of those controls by themselves fix both the unlawful transfer and the indefinite retention issues that triggered the compliance finding.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Standard Contractual Clauses (SCCs) under GDPR?
What is the GDPR storage limitation principle?
How does geo-fencing aid in GDPR compliance for data replication?
A marketing analyst receives a daily orders file named orders_2025-08-28.json from an internal API. Each JSON record represents a single order and contains a line_items array; every element of line_items is itself an object that holds product_id, quantity and unit_price. The analyst must ingest the data into a relational reporting table that has one row per order. Based on the characteristics of the .json format, which technical challenge is the analyst MOST likely to face during the load?
Decompressing the mandatory GZIP compression that JSON applies to all text files.
Translating 64-bit integers from big-endian to little-endian format before they can be stored in the database.
Flattening hierarchical objects and arrays that do not map cleanly to a two-dimensional row-and-column structure.
Converting extended characters because JSON lacks native support for Unicode (UTF-8) encoding.
Answer Description
JSON is a text-based format that can nest objects and arrays inside other objects. When a JSON document includes an array-such as line_items-the structure becomes hierarchical. Relational tables, however, are two-dimensional and expect one scalar value per column, so the nested content must be "flattened" (for example, exploded into separate rows or moved to a child table) before it can be inserted.
The other options describe problems that do not inherently apply to .json files:
- JSON files are plain text; GZIP or any other compression is optional, not mandatory.
- Numbers in JSON are encoded as character strings, not as binary integers, so byte-order (endianness) translation is unnecessary.
- The JSON standard requires UTF-8 encoding for open data exchange, so full Unicode character sets are already supported.
Therefore, the most likely obstacle is flattening the nested arrays and objects into a flat relational structure.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does it mean to flatten hierarchical JSON data?
How can nested JSON arrays be handled when loading data into a relational database?
Why don't JSON files require compression or endianness conversions?
Block storage is often selected to host high-performance transactional databases in both cloud and on-premises environments. Which of the following characteristics best explains why block storage fits this workload?
It stores each dataset as immutable objects along with extensive custom metadata and automatically replicates or erasure-codes those objects across geographic regions to maximize durability.
It presents volumes to the operating system as raw disks whose data is divided into fixed-size blocks that can be addressed over protocols like iSCSI or NVMe/TCP, delivering consistently low I/O latency.
It exposes data through a hierarchical directory path over shared-file protocols such as NFS or SMB so multiple users can concurrently edit common files.
It stripes data across nodes using erasure coding to reduce capacity consumption, accepting higher write latency suited mainly for cold-archive workloads.
Answer Description
Block storage divides data into fixed-size chunks (blocks) that are addressed directly by the operating system using low-level protocols such as iSCSI or NVMe/TCP. Because the host treats the volume like a locally attached disk, the database can format the device with its own file system and issue random reads and writes with sub-millisecond latency-ideal for OLTP workloads.
The other statements describe capabilities of other storage types:
- Object storage keeps data as immutable objects with rich metadata and often uses replication or erasure coding across regions to maximize durability; this design trades latency for scalability and resilience.
- File storage exposes a hierarchical directory structure over protocols like NFS or SMB for shared user access, which adds additional overhead compared with raw block devices.
- Erasure-coded, capacity-optimized architectures are tuned for archival or infrequently accessed data, not latency-sensitive transaction processing.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is iSCSI and how does it relate to block storage?
What is the difference between block storage and object storage?
Why is low I/O latency important for transactional databases?
A data analyst is developing a star schema to analyze sales performance. The goal is to aggregate key business metrics, such as quantity_sold and sale_amount, for every transaction. This central table must also connect to dimensional tables for Date, Product, and Store. Which type of table should the analyst use for this purpose?
Staging table
Dimensional table
Fact table
Bridge table
Answer Description
The correct answer is a fact table. In a dimensional model like a star schema, the fact table is the central table that stores quantitative measurements or metrics about a business process. In this scenario, the quantity_sold and sale_amount are the facts. The fact table also contains foreign keys that connect to the primary keys of the surrounding dimensional tables (Date, Product, and Store), which provide context to the facts.
- A dimensional table stores the descriptive attributes related to the facts, such as product names, store addresses, or customer details, not the quantitative transactional data itself.
- A bridge table is a specific type of table used to resolve many-to-many relationships between a fact and a dimension or between two dimensions, which is not the primary purpose described in the scenario.
- A staging table is a temporary storage area used during the ETL (Extract, Transform, Load) process to clean and prepare data before it is loaded into the final destination, like a data warehouse. It is not a permanent part of the analytical schema.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the primary role of a fact table in a star schema?
Why do fact tables use foreign keys to connect to dimension tables?
How is a fact table different from a dimensional table?
A retail company has collected a vast dataset of millions of user-submitted images containing its products in various real-world settings. The goal is to develop a system that can automatically identify the specific product and its condition from each image. Which of the following AI concepts is best suited to handle this type of complex pattern recognition in large, unstructured datasets like images?
Automated reporting
Robotic Process Automation (RPA)
Deep learning
Natural Language Processing (NLP)
Answer Description
The correct answer is deep learning. Deep learning is a subset of machine learning that uses multi-layered neural networks to automatically learn and identify complex patterns from large amounts of raw, unstructured data, such as images. It is the primary technology behind modern image recognition and computer vision systems.
- Robotic Process Automation (RPA) is incorrect because it is designed to automate structured, rule-based tasks by mimicking human interactions with software interfaces, not to analyze or interpret unstructured data like images.
- Natural Language Processing (NLP) is incorrect as it focuses on the interaction between computers and human language (text and speech), not visual data.
- Automated reporting is incorrect because it is the process of generating reports, usually from structured data sources, and does not involve the complex analysis of image content.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes deep learning suitable for image recognition tasks?
How is deep learning different from traditional machine learning?
What are some specific examples of deep learning applications in real-world scenarios?
A regional retail chain tracks point-of-sale data that is loaded into its data warehouse every night by 04:00. The sales director wants store managers to open an existing Power BI dashboard at 08:00 each Monday and immediately see a summary of the previous week's results without having to click a refresh button or run a query. Which delivery approach best meets this requirement while minimizing manual effort?
Switch the dataset to DirectQuery so the dashboard streams live transactions whenever someone opens it.
Provide an ad-hoc report template that managers must run and filter themselves each Monday morning.
Export the dashboard as a static PDF every Friday afternoon and email it to all store managers.
Configure a scheduled refresh that runs at 05:00 every Monday so the dashboard is updated before managers log in.
Answer Description
Because the managers need the same metrics on a predictable cadence (every Monday) and the source data is already finalized overnight, the most efficient solution is to set up an automated, recurring dashboard refresh that runs early Monday morning. A scheduled refresh guarantees the dashboard opens with fresh data and eliminates the need for users to trigger an update. Real-time DirectQuery is unnecessary for once-per-week consumption and can add gateway overhead; an ad-hoc template still requires each manager to generate the report manually; exporting a static PDF on Friday would leave three days of weekend sales missing by Monday.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a scheduled refresh in Power BI?
Why is DirectQuery not suitable for this scenario?
What are the limitations of static PDF exports for dashboards?
During a performance review you discover that a reporting query contains this pattern:
SELECT ...
FROM (
SELECT CustomerID, SUM(TotalDue) AS TotalSpent
FROM dbo.Orders
WHERE OrderDate >= '2024-01-01'
GROUP BY CustomerID
) AS recent_orders
JOIN dbo.Orders o1 ON o1.CustomerID = recent_orders.CustomerID
JOIN dbo.Orders o2 ON o2.CustomerID = recent_orders.CustomerID;
The execution plan shows the derived subquery against the 50-million-row Orders table is executed three times, causing very high logical reads. Without changing the final results, which action is most likely to reduce execution time and I/O?
Add an OPTION (FORCESEEK) hint to every Orders reference to force index seeks during each scan.
Insert the subquery results into a local temporary table (#recent_orders), add an index on CustomerID, and join the main query to that temporary table.
Rewrite the derived subquery as a common table expression (CTE) so SQL Server can cache the result internally.
Add WITH (NOLOCK) hints to all Orders references to avoid locking during the scans.
Answer Description
Persisting the subquery result in a session-scoped temporary table means the expensive aggregation runs only once. The main query can then join to the smaller, indexed #recent_orders table, eliminating two extra scans of the 50-million-row Orders table and sharply reducing logical reads. A CTE is usually inlined, so the optimizer may still re-evaluate it; index or lock hints leave the multiple scans in place and merely influence access or concurrency methods, so they do not address the root cause.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a session-scoped temporary table in SQL?
Why is indexing a temporary table like #recent_orders important?
Why is a common table expression (CTE) not ideal in this scenario?
Your organization is designing a star schema for its e-commerce data warehouse. The model includes a very large Sales fact table that must join to a Product dimension table containing thousands of descriptive attributes (brand, category, size, color, etc.). To follow dimensional-modeling best practices and minimize storage and join costs in the fact table, which primary-key strategy is most appropriate for the Product dimension table?
A composite key of ProductSKU combined with EffectiveStartDate and EffectiveEndDate
An auto-incrementing integer surrogate key generated within the data warehouse
A globally unique identifier (GUID) assigned by the e-commerce application
A concatenated natural key made of SupplierID and ManufacturerPartNumber
Answer Description
Dimensional modeling guidelines recommend assigning each dimension row a meaningless, sequential surrogate key-typically a 4-byte integer-rather than relying on natural business keys or wide composite keys. A compact surrogate key keeps the fact table's foreign-key columns small (saving space and index overhead), speeds join performance, shields the warehouse from changes to source-system codes, and is mandatory when tracking slowly changing dimension history. Natural keys, GUIDs and composite keys all consume more space, may change unexpectedly, and complicate surrogate-key lookups for historical versioning, so they are not preferred choices for a dimension table in a star schema.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a surrogate key in dimensional modeling?
Why is a surrogate key better than a GUID in a star schema?
What are slowly changing dimensions, and how do surrogate keys help manage them?
A data analyst is helping the product team create a survey to measure customer satisfaction with a new feature in their mobile application. The primary goal is to collect quantitative data that can be used to calculate an average satisfaction score and monitor trends over time. Which of the following question types would be the MOST appropriate to include in the survey to meet this specific requirement?
A 5-point Likert scale question, such as 'How satisfied are you with the new feature?' with options from 'Very Dissatisfied' to 'Very Satisfied'.
A dichotomous question, such as 'Did you find the new feature useful?' with 'Yes' or 'No' answers.
A multiple-choice question, such as 'Which of the following best describes your experience with the new feature?' with options like 'Easy to use', 'Buggy', and 'Helpful'.
An open-ended question, such as 'What are your thoughts on the new feature?'.
Answer Description
The correct answer is the 5-point Likert scale question. This type of question asks respondents to rate their level of agreement or satisfaction on a scale, which generates ordinal data. This data is commonly treated as interval data in business analytics, allowing for the calculation of an average (mean) score. This directly meets the project's goal of calculating an average satisfaction score to track over time.
The multiple-choice option provides nominal (categorical) data. It is not possible to calculate a meaningful average from non-ordered categories like 'Easy to use' or 'Buggy'. An open-ended question generates qualitative (text) data, which is valuable for deep insights but cannot be used to calculate a numerical average without significant extra processing like sentiment analysis or manual coding. A dichotomous question provides binary ('Yes'/'No') data. While you can calculate the percentage of 'Yes' responses, it does not provide a granular score that can be averaged to show varying levels of satisfaction.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a 5-point Likert scale question, and why is it suitable for measuring satisfaction?
Why can't multiple-choice or open-ended questions be used to calculate an average satisfaction score?
Why is a dichotomous question, like 'Yes' or 'No,' not the best option in this scenario?
A marketing manager at a subscription-based streaming service asks a data analyst to identify which current customers are most likely to cancel their subscriptions within the next 60 days. The analyst has access to a dataset containing each customer's viewing history, subscription tenure, and past support interactions, along with information on which similar customers have canceled in the past. The manager's goal is to proactively target these at-risk customers with a special retention offer. Which statistical method is MOST appropriate for fulfilling the manager's primary request?
Inferential
Predictive
Prescriptive
Descriptive
Answer Description
The correct answer is Predictive analysis. The scenario requires forecasting a future event, specifically, which customers are likely to cancel their subscriptions. Predictive analytics uses historical data to find patterns and make predictions about future outcomes.
- Descriptive analytics is incorrect because it focuses on summarizing past data to understand what has already happened (e.g., calculating the average subscription length of customers who have already churned). It does not forecast future events.
- Prescriptive analytics is incorrect because it recommends actions to take. While the ultimate goal is to act (the retention offer), the analyst's primary and immediate task is to identify the at-risk customers (prediction), which is a necessary step before a specific action can be prescribed.
- Inferential analytics is incorrect because it involves using a data sample to make generalizations or test hypotheses about a larger population (e.g., testing if there is a statistically significant difference in churn rates between two customer segments). The request is to create a model to score individual customers, not to test a general population hypothesis.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
How does Predictive Analytics determine which customers are likely to cancel their subscriptions?
How is Predictive Analytics different from Descriptive and Prescriptive Analytics?
What is the role of similar customer data in Predictive Analytics?
Woo!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.