A data scientist is developing a customer churn prediction model. The source data is a long-format transactional table with the columns customer_id, event_timestamp, and event_type (e.g., 'page_view', 'add_to_cart', 'purchase'). For the model, the data scientist needs to create a feature matrix where each row represents a single customer_id, and the columns represent the total count of each unique event_type for that customer.
Which data transformation technique should the data scientist apply to reshape the data into this required wide-format feature matrix?
Pivoting the table with customer_id as the index, event_type as the columns, and a count aggregation function.
Applying one-hot encoding to the event_type column.
Binning the event_timestamp column into time-based categories.
Normalizing the customer_id and event_type columns.
The correct answer is to pivot the data. Pivoting is a data transformation technique used to reshape a dataframe from a long format to a wide format. In this scenario, the data scientist would set customer_id as the index, the unique values from the event_type column as the new column headers, and use an aggregation function, such as count, to populate the values of these new columns. This process directly produces the desired feature matrix where each row is a unique customer and each column is the count of a specific event type.
Applying one-hot encoding to the event_type column is incorrect because, while it does create new binary columns for each category, it does so for every row in the original long-format table. It does not perform the necessary aggregation to consolidate the data into a single row per customer.
Binning is incorrect as it is a technique used to convert continuous numerical data into discrete categorical bins. It does not address the need to reshape the structure of the dataset from long to wide.
Normalizing is incorrect as it is a feature scaling technique used to adjust the range of numerical features to a common scale. It is irrelevant to the task of structurally reshaping the data frame.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.