Databricks Exam Databricks-Machine-Learning-Associate Topic 3 Question 22 Discussion

Actual exam question for Databricks's Databricks-Machine-Learning-Associate exam

Question #: 22
Topic #: 3

[All Databricks-Machine-Learning-Associate Questions]

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

AOne-hot encoding categorical features

BTarget encoding categorical features

CImputing missing feature values with the mean

DImputing missing feature values with the true median

ECreating binary indicator features for missing values

Show Suggested Answer

Suggested Answer: E

To display visual histograms and summaries of the numeric features in a Spark DataFrame, the Databricks utility function dbutils.data.summarize can be used. This function provides a comprehensive summary, including visual histograms.

Correct code:

dbutils.data.summarize(spark_df)

Other options like spark_df.describe() and spark_df.summary() provide textual statistical summaries but do not include visual histograms.

Databricks Utilities Documentation

by Sharen at Dec 06, 2024, 09:11 PM

Limited Time Offer

25%

24 days ago

I think target encoding categorical features will be the least efficient to distribute.

upvoted 0 times

...