CompTIA Data+ Practice Test (DA0-002)
Use the form below to configure your CompTIA Data+ Practice Test (DA0-002). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

CompTIA Data+ DA0-002 (V2) Information
The CompTIA Data+ exam is a test for people who want to show they understand how to work with data. Passing this exam proves that someone can collect, organize, and study information to help businesses make smart choices. It also checks if you know how to create reports, use charts, and follow rules to keep data safe and accurate. CompTIA suggests having about 1 to 2 years of experience working with data, databases, or tools like Excel, SQL, or Power BI before taking the test.
The exam has different parts, called domains. These include learning basic data concepts, preparing data, analyzing it, and creating easy-to-read reports and visualizations. Another important part is data governance, which covers keeping data secure, private, and high quality. Each section of the test has its own percentage of questions, with data analysis being the largest part at 24%.
Overall, the CompTIA Data+ exam is a good way to prove your skills if you want a career in data. It shows employers that you know how to handle data from start to finish, including collecting it, checking it for errors, and sharing results in clear ways. If you enjoy working with numbers and information, this certification can be a great step forward in your career.

Free CompTIA Data+ DA0-002 (V2) Practice Test
- 20 Questions
- Unlimited
- Data Concepts and EnvironmentsData Acquisition and PreparationData AnalysisVisualization and ReportingData Governance
During a star-schema redesign, you are asked to modify the Customer dimension so that analysts can compare sales using the attributes that were valid at the time of each purchase (for example, the customer's region on the sale date). The data team is comfortable generating surrogate keys and adding effective and end-date columns, and they do not want to overwrite or delete historical rows. Which slowly changing dimension technique best satisfies these requirements?
Type 4 - maintain a separate history table while the main table holds the current row
Type 2 - insert a new row with a new surrogate key and effective-date range for every change
Type 3 - add additional columns to store the previous value, keeping only limited history
Type 1 - overwrite the existing row so only the current state is kept
Answer Description
The requirement is to keep every historical version of a dimension member, tie each fact row to the correct version, and identify the current record with a simple flag or NULL end date. SCD Type 2 meets all of these needs by inserting a new row (with a new surrogate key) whenever an attribute changes while retaining the prior rows. Type 1 overwrites values and loses history. Type 3 stores only a limited number of previous values in extra columns. Type 4 moves history to a separate table, none of which guarantee full, row-level history in the base dimension tied to facts via distinct surrogate keys.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is SCD Type 2 considered the best choice for retaining historical data in a star-schema redesign?
What are surrogate keys, and why are they important in SCD Type 2 implementations?
What is the purpose of effective-date and end-date columns in SCD Type 2?
A healthcare provider loads patient encounter records into a Snowflake data warehouse every hour. The data team already has automated monitors that alert on row-count volume, table freshness, and schema changes. They now want to be warned if the relative frequency of ICD-10 diagnosis codes drifts gradually over several days-something that could skew epidemiological trend reports but would not break the pipeline outright. Which additional automated data-quality monitor would most directly address this requirement?
A regex validation that ensures each diagnosis_code matches the pattern [A-Z][0-9][0-9].[0-9]
A not-null percentage rule that triggers when the diagnosis_code column contains more than 1% NULL values
A statistical distribution monitor that compares current diagnosis-code frequencies with historical baselines and alerts on significant deviations
A referential-integrity check that confirms every encounter links to an existing patient_id in the patient table
Answer Description
A monitor that compares the current statistical distribution of diagnosis-code values with a historical baseline is designed to surface gradual drift. By tracking metrics such as frequencies, mean, variance, or standard deviation over time, the monitor can issue alerts when the distribution shifts beyond an acceptable confidence band. Static checks like NOT-NULL rules, regex format validation, or referential-integrity tests each protect other quality dimensions (completeness, syntactic validity, relational consistency) but will not detect a legitimate yet drifting pattern of values. Therefore, the distribution-based anomaly-detection monitor best satisfies the requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is statistical distribution monitoring?
What are ICD-10 diagnosis codes?
How does Snowflake support data-quality monitoring?
Your organization is building a logistics dashboard that must display parcel-tracking events from a major shipping carrier within five minutes of each scan. The carrier can share its data in one of four ways: dropping a daily CSV file on an SFTP server, exposing an authenticated REST endpoint that returns JSON, presenting parcel status on public HTML pages, or emailing a weekly XLSX report. Which data-collection approach is the most appropriate data source to meet the dashboard's requirement while minimizing additional parsing effort?
Call the carrier's authenticated REST API to retrieve tracking events in JSON format.
Use a web-scraping script to extract tracking details from the carrier's HTML pages.
Download the daily CSV file from the SFTP drop and load it into the dashboard.
Import the weekly XLSX spreadsheet that the carrier emails to the operations team.
Answer Description
A REST API is specifically designed for system-to-system data exchange. Because the carrier's endpoint returns machine-readable JSON, the analytics team can ingest the data with little transformation and schedule calls as frequently as needed, achieving near-real-time updates. The SFTP CSV and weekly XLSX options introduce batch latency that violates the five-minute requirement, while the HTML option relies on web scraping, adding fragility and extra parsing work. Therefore, the authenticated REST API is the best fit.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a REST API and why is it suitable for real-time data exchange?
What is JSON, and why is it preferred for data exchange compared to CSV or HTML formats?
What challenges are associated with web scraping HTML data for dashboards?
A data-analytics team is building a new pipeline that will run on a 100-node Apache Spark cluster. The team wants to (1) write code in the same language that Spark itself is implemented in, (2) gain immediate access to Spark's native DataFrame and Dataset APIs when a new Spark version is released, and (3) avoid the extra Py4J (or similar) serialization layer that adds cross-language overhead. According to the CompTIA Data+ list of common programming languages, which language should the team choose?
Java
Python
Scala
R
Answer Description
Scala is the language in which Apache Spark is primarily written, so developing Spark jobs in Scala avoids the Python-to-JVM bridge that Py4J introduces, removes an entire layer of serialization overhead, and lets developers use the DataFrame and strongly typed Dataset APIs as soon as they are available. Python, R, and Java can all be used with Spark, but they either rely on a bridging layer (Py4J for Python, rJava for R) or lag slightly behind Scala in API updates. Java is close, yet Spark's higher-level APIs and examples are maintained first in Scala, making Scala the best fit for the stated requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Scala the preferred language for Apache Spark development?
What is the purpose of Py4J in Apache Spark?
What are the differences between DataFrame and Dataset APIs in Spark?
Your team has finished analyzing customer-satisfaction metrics for fiscal Q2. According to a mandate from corporate communications, the results summary must be a static document (such as a PDF, DOC, or image) for posting on the public investor-relations site. The content must also comply with Section 508 and WCAG-AA standards to ensure board members who rely on screen-reader software can independently review it. Finally, stakeholders will download the report on both mobile and desktop devices. Given these requirements, which communication approach provides the most accessible experience?
Export the visual summary to a tagged PDF that includes alt text for every chart, a logical reading order, and a high-contrast color theme.
Record a narrated screencast walk-through of the dashboard and embed the video on the site without closed captions or transcripts.
Publish an interactive HTML5 dashboard in which insights appear only when users hover over elements or interpret color cues.
Email the underlying Excel workbook that uses red/green conditional formatting to indicate trends but contains no supporting narrative.
Answer Description
Exporting the visuals to a properly tagged PDF that uses high-contrast colors and includes concise alt text meets all stated constraints. A tagged PDF carries a logical reading order and metadata that screen readers can parse, while alt text conveys the meaning of charts for users who cannot see them. High-contrast colors improve legibility for low-vision readers and satisfy WCAG contrast thresholds.
The interactive HTML5 dashboard is disallowed because the site only accepts static files, and hover-only interactions are not keyboard or screen-reader friendly. The color-coded Excel workbook fails WCAG 1.4.1 because color is the sole cue and offers no alt text or meaningful structure. The narrated screencast, lacking captions or a text alternative, excludes deaf or hard-of-hearing users and cannot be parsed by screen readers, so it does not satisfy the accessibility requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Section 508 and why is it important?
What does WCAG-AA compliance mean?
What are tagged PDFs and why are they essential for accessibility?
A data analyst is working on a new social media analytics platform. The platform must store vast amounts of user-generated content, including posts and user profiles that have varying attributes. The system must also efficiently map and query complex user relationships, such as friendships and content interactions, to identify key influencers. Given these requirements for a flexible schema and relationship-focused queries, which type of non-relational database would be the MOST suitable choice?
Column-family database
Key-value store
Graph database
Document database
Answer Description
The correct answer is a graph database. Graph databases are purpose-built to store and navigate relationships. Data is modeled as nodes (e.g., users, posts) and edges (e.g., 'friend', 'liked'), making them extremely efficient for traversing complex and deep relationships, which is a core requirement for identifying influencers in a social network.
A document database is incorrect because while it handles flexible schemas and user profiles well, it is less efficient than a graph database for analyzing complex, interconnected relationships between different data entities. A key-value store is also incorrect. It is designed for simple, fast lookups based on a unique key but lacks the capability to perform the complex relationship queries needed for this scenario. A column-family database is incorrect as it is optimized for analytical queries on large datasets and high-speed writes, but not specifically for traversing the types of relationships found in a social network.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a graph database, and how does it store data?
Why is a document database less efficient than a graph database for relationship-focused queries?
What are some real-world examples where graph databases excel?
A marketing department wants to optimize the allocation of its upcoming quarterly budget to maximize new customer acquisitions. A data analyst is tasked with developing a model that uses historical campaign performance data to recommend the most effective distribution of funds across various channels, such as social media, email, and pay-per-click advertising. Which type of statistical method is being employed to generate this recommendation?
Prescriptive analytics
Inferential analytics
Predictive analytics
Descriptive analytics
Answer Description
The correct answer is prescriptive analytics. This method is used to determine the best course of action to achieve a specific goal. In this scenario, the model is not just describing past results or predicting future outcomes; it is actively recommending an optimal budget allocation to maximize acquisitions.
- Predictive analytics is incorrect because it focuses on forecasting what is likely to happen in the future but does not recommend a specific action. For example, a predictive model could forecast the number of acquisitions for a given budget but would not determine the optimal budget distribution itself.
- Descriptive analytics is incorrect because it focuses on summarizing past data to understand what has happened. For instance, it would be used to report on the return on investment of past campaigns but not to recommend future strategies.
- Inferential analytics is incorrect because it involves using a data sample to make generalizations about a larger population. While it might be a component of the overall analysis, the primary goal here is to optimize and recommend a specific set of actions, which is the core function of prescriptive analytics.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are the key differences between prescriptive analytics and predictive analytics?
What kind of algorithms or models might be used in prescriptive analytics?
How does historical data contribute to prescriptive analytics?
A data analyst is preparing a large dataset of digital photographs for a computer vision model. The key requirement is to reduce the overall file size of the dataset to optimize storage and processing speed. A slight, often imperceptible, reduction in image quality is acceptable to achieve this goal. Which file format should the analyst use?
.tiff
.jpg
.bmp
.png
Answer Description
The correct answer is .jpg. JPEG (Joint Photographic Experts Group) files use a lossy compression method, which significantly reduces file size by permanently discarding some image data. This makes the .jpg format ideal for situations where minimizing storage and optimizing load times is critical, and a minor reduction in quality is an acceptable trade-off, especially for photographic images.
- .png: This format uses lossless compression, which preserves all original image data, resulting in higher quality but also larger file sizes compared to .jpg. It is not the best choice when the primary goal is file size reduction.
- .tiff: This format uses lossless compression or no compression at all, which preserves the highest image quality but results in very large file sizes. It is often used for professional printing and archiving, not for optimizing storage space.
- .bmp: This format is typically uncompressed, leading to extremely large file sizes. It is unsuitable for this scenario where file size reduction is a key requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does .jpg use lossy compression, and how does it affect image quality?
What are some scenarios where .png is preferred over .jpg?
How do .tiff and .bmp formats differ from .jpg in terms of use cases?
A data analyst is assigned to a project involving a large, established relational database. Before writing any queries, the analyst needs to identify the tables available, the columns within each table, and the primary and foreign key relationships that link them together. Which of the following database components provides this formal structural blueprint?
Bridge table
Database schema
Fact table
Data dictionary
Answer Description
The correct answer is the database schema. The schema serves as the blueprint for a relational database, formally defining its structure, including tables, the columns and data types within those tables, and the relationships between them (such as primary and foreign keys). A data dictionary provides descriptive information and business context about the data elements but is not the formal structural blueprint itself. A fact table contains quantitative business metrics and is a component within a schema, not the overall blueprint. A bridge table is a specific type of table used to resolve many-to-many relationships and is also a component within a schema.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a database schema?
How does a database schema differ from a data dictionary?
What is the role of primary and foreign keys in a database schema?
A team is redesigning a SQL Server customer table that will store names from more than 40 countries, including languages that use Chinese, Arabic, and Cyrillic characters. Names can be up to 200 characters long, but most are under 50 characters. Which SQL string data type BEST satisfies the business requirements for international character support and keeps storage overhead low when the value is shorter than the maximum length?
nvarchar(200)
nchar(200)
varchar(200)
char(200)
Answer Description
The column must store multilingual text, so a Unicode-capable data type is mandatory. In SQL Server, the Unicode options are nchar and nvarchar.
- nchar(200) allocates a fixed 200 characters (400 bytes) for every row, wasting space on shorter names.
- nvarchar(200) is Unicode and variable-length; it stores only the bytes required for each value plus a small length indicator, so short names consume far less space.
- char(200) and varchar(200) are non-Unicode in most collations (unless UTF-8 is explicitly enabled), so they cannot reliably hold Chinese or Arabic characters.
Therefore, nvarchar(200) is the best fit because it supports full Unicode and minimizes storage for values that are typically shorter than the declared maximum.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between char, varchar, nchar, and nvarchar in SQL?
Why is Unicode important for storing multilingual text in SQL?
What are some best practices when choosing string data types for database design?
After today's automated refresh, a retail sales dashboard suddenly shows a Gross Profit Margin of 215 %, yet a manual SQL check of the ERP system still returns the expected 20 % range. No ETL errors were logged and other visuals appear normal. Before involving the data-engineering team, which review technique should the analyst apply first to efficiently locate the problem inside the report?
Ask a colleague to conduct a peer review focused on the dashboard's visual design and storytelling elements.
Configure an automated alert that emails the analyst whenever the profit-margin card exceeds a predefined threshold.
Examine the report's measures and aggregation formulas to verify that the profit-margin calculation logic is correct.
Perform a detailed line-by-line code review of the overnight ETL and SQL extraction scripts that populate the model.
Answer Description
The most likely cause of an implausible 215 % margin is an error in the KPI's formula-such as dividing by the wrong column or applying an incorrect aggregate. A calculation review focuses on auditing the measures, expressions or DAX used in the report, enabling the analyst to confirm or correct the logic quickly. A peer review that concentrates on layout would not inspect the math, a code review of the entire ETL pipeline is time-consuming and unnecessary when other metrics look fine, and an alert only signals the symptom without revealing the root cause.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does KPI mean in this context?
What is DAX and why is it relevant here?
Why is the ETL process unlikely to be the issue here?
During an initial exploration of a CRM export that will feed a data warehouse, you notice the staging table contains two seemingly identical columns-Customer_ID and CRM_Customer_GUID. To test whether one column is a redundant copy of the other, you execute:
SELECT COUNT()
FROM staging.crm_sales
WHERE Customer_ID IS DISTINCT FROM CRM_Customer_GUID;
The query returns 0 rows. Based solely on this result, which conclusion is most appropriate?
The two columns store the same values in every record, so keeping both is redundant and one can be dropped.
Zero mismatches prove the table contains duplicate rows, so both columns are needed to remove those duplicates later.
The test shows that each column is unique but represents a different business key, so no redundancy exists.
Because there are no mismatches, Customer_ID must be a foreign key that references CRM_Customer_GUID, so both columns should remain.
Answer Description
The predicate IS DISTINCT FROM returns TRUE only when the two compared expressions differ, treating NULL as a normal value. Because the query returned no rows, it means there is no record in which Customer_ID and CRM_Customer_GUID differ (including no mismatched NULLs). Therefore the two columns carry identical information for every row, so retaining both adds unnecessary storage and maintenance overhead-one of them can be removed. The other options misinterpret the result: a lack of mismatches does not imply duplicate rows, identify different business keys, or reveal a foreign-key relationship. It simply shows the columns are equivalent and hence redundant.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does the SQL clause 'IS DISTINCT FROM' do?
Why are redundant columns in a database problematic?
How does this SQL query determine redundancy between two columns?
A data analyst is tasked with creating a deliverable for the sales leadership team to support their quarterly performance review meetings. The leadership team needs to see high-level KPIs but also wants the ability to interactively filter the data during the meeting. For example, they want to drill down into regional performance, compare product sales, and isolate data for specific sales representatives to understand the context behind the numbers. Which of the following delivery methods would BEST meet these requirements?
A static dashboard
An executive summary
A recurring static report
A dynamic dashboard
Answer Description
The correct answer is a dynamic dashboard. The scenario's key requirement is for a deliverable that allows the leadership team to interactively filter and drill down into the data during their meetings. A dynamic dashboard is specifically designed for this purpose, offering interactive elements like slicers, filters, and drill-down capabilities that allow users to explore the data in real time.
A static dashboard is incorrect because, while it can display the high-level KPIs, it presents a fixed, non-interactive snapshot of the data and would not allow for real-time exploration during the meeting. A recurring static report would be delivered on a schedule, such as quarterly, but it would still be a fixed report without the interactive features needed for live analysis. An executive summary is a narrative document that provides a high-level overview and key findings; it is not an interactive data exploration tool.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What features make a dynamic dashboard interactive?
How does a static dashboard differ from a dynamic dashboard?
Why wouldn't an executive summary meet the requirements in this scenario?
While building a marketing dashboard, you receive newline-delimited JSON files where each record contains a nested "events" list and a "device" dictionary. You need to convert these nested structures into one flat, tabular DataFrame so the results can be loaded directly into a relational table-without writing custom loops or manual parsing code. Which pandas function provides the most straightforward way to perform this flattening step in Python?
pandas.melt()
pandas.pivot_table()
pandas.read_html()
pandas.json_normalize()
Answer Description
pandas.json_normalize is purpose-built to transform semi-structured or deeply nested JSON data into a flat, two-dimensional DataFrame. It automatically traverses dictionaries and lists, creating column headers that reflect the nested path. melt reshapes existing DataFrames from wide to long format, pivot_table summarizes data through aggregation, and read_html imports HTML tables-none of these functions flatten nested JSON. Therefore, json_normalize is the correct choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is pandas.json_normalize()?
How does pandas.json_normalize handle nested JSON with lists and dictionaries?
When would you choose pandas.json_normalize over other pandas functions like melt or pivot_table?
A data analyst is preparing a report on annual employee compensation. The dataset includes salaries for all employees, from entry-level positions to C-suite executives. The analyst observes that the salary distribution is heavily skewed to the right because of a few extremely high executive salaries. To communicate the most representative measure of a 'typical' employee's earnings without being distorted by these high values, which of the following mathematical functions should the analyst primarily use?
Median
Standard Deviation
Mode
Mean
Answer Description
The correct answer is the median. In a dataset with a skewed distribution, such as salary data with a few very high earners, the median is the most appropriate measure of central tendency. The median represents the middle value and is not significantly affected by extreme outliers. The mean would be pulled upwards by the high executive salaries, giving a misleadingly high representation of a 'typical' salary. The standard deviation is a measure of data dispersion or spread around the mean, not a measure of central tendency. The mode, which is the most frequent value, is often not suitable for continuous data like salaries and may not provide a meaningful 'typical' value.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is the median more reliable than the mean in a skewed dataset?
What does it mean for data to be 'skewed'?
When is it appropriate to use the mode over the median or mean for central tendency?
A data analyst is working with a 'ProductNotes' text field in a sales database. This field contains user-entered notes and a product's Stock Keeping Unit (SKU). The SKUs are not in a separate, structured column. The SKU has a consistent format: it always starts with "SKU:", followed by exactly three uppercase letters, a hyphen, and five digits (e.g., 'SKU:ABC-12345'). The goal is to extract only the SKU identifier (e.g., 'ABC-12345') into a new column. Which of the following regular expressions should the analyst use to capture just the SKU identifier?
SKU:[A-Z]{3}-\d{5}(SKU:[A-Z]{3}-\d{5})SKU:([A-Z]{3}-\d{5})SKU:([A-z]{3}-\d{5})
Answer Description
The correct regular expression is SKU:([A-Z]{3}-\d{5}). The parentheses () create a 'capturing group.' This tells the regex engine to match the entire pattern but to specifically capture and return only the portion of the string that matches the pattern inside the parentheses. In this case, it matches 'SKU:' but only captures the [A-Z]{3}-\d{5} part.
(SKU:[A-Z]{3}-\d{5})is incorrect because the capturing group includes the 'SKU:' prefix, which the requirement states should be excluded from the final extracted value.SKU:[A-Z]{3}-\d{5}is incorrect because it will find and match the pattern, but it lacks a capturing group(), so it will not isolate and extract only the identifier part.SKU:([A-z]{3}-\d{5})is incorrect because the character range[A-z]is a common mistake. In ASCII, this range includes non-alphabetic characters that fall between 'Z' and 'a' (such as '[', '', ']', '^', '_', ''). The requirement specifies only uppercase letters, for which[A-Z]` is the correct and precise range.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does a 'capturing group' in regex do?
Why is `[A-Z]` preferred over `[A-z]` in regex?
What happens if a regex pattern doesn't have a capturing group?
A data analyst is tasked with creating a new sales performance dashboard that will be distributed to both internal managers and external partners. The company has recently undergone a rebranding and published a strict corporate style guide. To ensure the dashboard is immediately recognizable and aligns with the new corporate identity, which of the following design elements should be the analyst's primary focus?
Select the most visually complex chart types to impress the external partners.
Incorporate the company's logo and official color palette as defined in the style guide.
Use a color-blind-friendly palette with the highest possible contrast.
Optimize the dashboard's data queries to ensure the fastest possible load times.
Answer Description
The correct answer is to incorporate the company's logo and official color palette as defined in the style guide. Adhering to a corporate style guide, especially for materials shared with external partners, is crucial for maintaining brand identity, consistency, and professionalism. The style guide provides the specific elements, such as logos and color palettes, that create a recognizable and trusted look and feel.
- Optimizing data queries is a technical task related to performance, not a visual design element for branding.
- Using a color-blind-friendly palette is a best practice for accessibility, but when a strict corporate style guide is provided, the primary requirement is to follow those branding rules. Ideally, the corporate palette would already be accessible, but the analyst's first duty is brand alignment.
- Selecting complex charts does not address branding; the goal of chart selection should be clarity and accurate data representation, not visual complexity for its own sake.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a corporate style guide, and why is it important for a dashboard design?
Why is it not ideal to focus on visually complex charts for dashboards?
Why is a color-blind-friendly palette not the priority in this case?
During a routine data-profiling exercise on a new data warehouse, an analyst runs a foreign-key validation between the SalesFacts fact table and the ProductDim dimension table. The profiler report shows that 97.8% of the 1.2 million SalesFacts.ProductID values are found in ProductDim.ProductID, leaving 26,769 rows that reference a non-existent product. Which data-quality dimension is this report primarily quantifying?
Uniqueness
Timeliness
Completeness
Consistency
Answer Description
The report is measuring consistency. A key aspect of consistency is referential integrity, which ensures that values meant to link records across different tables are valid. In this case, the profiler is calculating what percentage of foreign-key values (SalesFacts.ProductID) have a matching primary-key value in the related dimension table (ProductDim.ProductID). Completeness would focus on whether mandatory fields are populated, uniqueness would evaluate duplicate records, and timeliness would consider how current the data is. None of those other dimensions rely on cross-table validation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is referential integrity in databases?
How does consistency differ from other data-quality dimensions like completeness or uniqueness?
What tools or methods can analysts use to validate foreign-key consistency?
During the weekly data-load process, a junior data analyst runs a SQL view that casts the column quantity_sold to INT. This week the script fails and returns the runtime error:
Conversion failed when converting the varchar value 'N/A' to data type int.
The schema of the staging and target tables has not changed since the previous successful load. Which action should the analyst take first to troubleshoot the issue and prevent it from happening in future loads?
Increase the database server's memory allocation so the CAST operation can complete in memory.
Validate the source file and cleanse any non-numeric values in
quantity_soldbefore loading the staging table.Enable detailed query-plan logging on the database server to capture the statement's execution plan.
Rewrite the view to use a FULL OUTER JOIN instead of an INNER JOIN to eliminate rows with nulls.
Answer Description
The error indicates that at least one row in quantity_sold contains a non-numeric string ("N/A"), so SQL Server cannot implicitly convert the value to an integer. According to data-validation best practices, the analyst should verify and cleanse the source data before it is loaded or cast. By validating the incoming extract and filtering or correcting non-numeric values, the analyst removes the root cause of the conversion failure and prevents the error from recurring.
- Enabling detailed query logging would show the failing statement but would not fix the data quality problem.
- Increasing server memory does not address the data-type mismatch.
- Rewriting the view with a different join type does not change the fact that
quantity_soldcontains invalid characters.
Therefore, validating and cleansing the source data is the most appropriate first step.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'casting' mean in SQL, and why is it important?
How can the cleansing of source data prevent runtime errors?
What tools or methods can be used to validate and cleanse source data?
After discovering that an internal spreadsheet containing employee Social Security numbers and salary data was accidentally shared with all staff on a corporate messaging platform, you remove the file within 10 minutes. The compliance group asks you to complete the initial data-breach incident report so the data-protection officer (DPO) can decide whether outside regulators must be notified. Which one of the following details is most critical to include in that first report?
A complete root-cause analysis that includes system patch-management logs
A brief description of the personal data exposed and an approximate count of the records affected
The marketing team's revision history for the spreadsheet
A cost-benefit analysis of notifying versus not notifying affected employees
Answer Description
Regulators typically need to know the nature and scope of the breach before anything else. Article 33(3)(a) of the GDPR, as well as comparable U.S. state breach-notification statutes and FTC guidance, explicitly require that a breach report describe the categories of personal data involved and the approximate number of records or individuals affected. Without this information, the DPO cannot assess the level of risk or determine whether external notification is legally required.
- The correct option provides the mandated information (type of data and record count).
- The other options may be useful later in an investigation (root-cause analysis, cost estimates, marketing details) but are not prerequisites for determining regulatory reporting obligations at the initial stage.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Article 33(3)(a) of the GDPR?
What defines 'personal data' in data-breach regulations like GDPR?
Why is the approximate number of records affected important in a breach report?
Woo!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.