Which of the following statements about broadcast variables is correct?
Broadcast variables are local to the worker node and not shared across the cluster.
This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes.
Broadcast variables are commonly used for tables that do not fit into memory.
This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.
Broadcast variables are serialized with every single task.
This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.
Broadcast variables are occasionally dynamically updated on a per-task basis.
This is wrong because broadcast variables are immutable -- they are never updated.
More info: Spark -- The Definitive Guide, Chapter 14
Which of the following describes a shuffle?
A shuffle is a Spark operation that results from DataFrame.coalesce().
No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors.
This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because
instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors.
More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)
The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.
A sample of DataFrame itemsDf is below.
Code block:
itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")
The correct code block looks like this:
Then, the first couple of rows of itemAttributesDf look like this:
explode() is not a method of DataFrame. explode() should be used inside the select() method instead.
This is correct.
The split() method should be used inside the select() method instead of the explode() method.
No, the split() method is used to split strings into parts. However, column attributs is an array of strings. In this case, the explode() method is appropriate.
Since itemId is the index, it does not need to be an argument to the select() method.
No, itemId still needs to be selected, whether it is used as an index or not.
The explode() method expects a Column object rather than a string.
No, a string works just fine here. This being said, there are some valid alternatives to passing in a string:
The alias() method needs to be called after the select() method.
No.
More info: pyspark.sql.functions.explode --- PySpark 3.1.1 documentation (https://bit.ly/2QUZI1J)
Which of the following statements about broadcast variables is correct?
Broadcast variables are local to the worker node and not shared across the cluster.
This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes.
Broadcast variables are commonly used for tables that do not fit into memory.
This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.
Broadcast variables are serialized with every single task.
This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.
Broadcast variables are occasionally dynamically updated on a per-task basis.
This is wrong because broadcast variables are immutable -- they are never updated.
More info: Spark -- The Definitive Guide, Chapter 14
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned
DataFrame?
Answering this Question: correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows:
DataFrame.sample(withReplacement=None, fraction=None, seed=None).
The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.
About replacement: 'Replacement' is easiest explained with the example of removing random items from a box. When you remove those 'with replacement' it means that after you have taken an
item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. 'Without
replacement' means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the
same item twice.
The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the Question: we are asked for 150 out of 1000 items -- a
fraction of 0.15.
The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you
would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!
More info: pyspark.sql.DataFrame.sample --- PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1, Question: 49 (Databricks import instructions)
Roxanne
3 days agoJovita
1 months agoJerry
2 months agoTyra
3 months agoSusana
3 months agoEstrella
4 months agoJulieta
4 months agoTelma
5 months agoPatti
5 months agoHillary
5 months agoCarmen
6 months agoMelita
6 months agoNieves
6 months agoLili
7 months agoCordelia
7 months agoDulce
7 months agoSelma
7 months agoGene
8 months agoIsaiah
8 months agoDenise
8 months agoCecily
8 months agoDonte
9 months agoRoxane
10 months agoDomitila
10 months ago