Which of the following statements about executors is correct?
Executors stop upon application completion by default.
Correct. Executors only persist during the lifetime of an application.
A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle,
independent of whether the application has been completed or not.
An executor can serve multiple applications.
Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above).
Each node hosts a single executor.
No. Each node can host one or more executors.
Executors store data in memory only.
No. Executors can store data in memory or on disk.
Executors are launched by the driver.
Incorrect. Executors are launched by the cluster manager on behalf of the driver.
More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to
clear some of the... | by Mageswaran D | Medium
Which of the following statements about broadcast variables is correct?
Broadcast variables are local to the worker node and not shared across the cluster.
This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes.
Broadcast variables are commonly used for tables that do not fit into memory.
This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.
Broadcast variables are serialized with every single task.
This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.
Broadcast variables are occasionally dynamically updated on a per-task basis.
This is wrong because broadcast variables are immutable -- they are never updated.
More info: Spark -- The Definitive Guide, Chapter 14
Which of the following describes a shuffle?
A shuffle is a Spark operation that results from DataFrame.coalesce().
No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors.
This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because
instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors.
More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)
The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.
A sample of DataFrame itemsDf is below.
Code block:
itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")
The correct code block looks like this:
Then, the first couple of rows of itemAttributesDf look like this:
explode() is not a method of DataFrame. explode() should be used inside the select() method instead.
This is correct.
The split() method should be used inside the select() method instead of the explode() method.
No, the split() method is used to split strings into parts. However, column attributs is an array of strings. In this case, the explode() method is appropriate.
Since itemId is the index, it does not need to be an argument to the select() method.
No, itemId still needs to be selected, whether it is used as an index or not.
The explode() method expects a Column object rather than a string.
No, a string works just fine here. This being said, there are some valid alternatives to passing in a string:
The alias() method needs to be called after the select() method.
No.
More info: pyspark.sql.functions.explode --- PySpark 3.1.1 documentation (https://bit.ly/2QUZI1J)
Which of the following statements about broadcast variables is correct?
Broadcast variables are local to the worker node and not shared across the cluster.
This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes.
Broadcast variables are commonly used for tables that do not fit into memory.
This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.
Broadcast variables are serialized with every single task.
This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.
Broadcast variables are occasionally dynamically updated on a per-task basis.
This is wrong because broadcast variables are immutable -- they are never updated.
More info: Spark -- The Definitive Guide, Chapter 14
Tamra
12 days agoRoxanne
2 months agoJovita
3 months agoJerry
4 months agoTyra
5 months agoSusana
5 months agoEstrella
6 months agoJulieta
6 months agoTelma
6 months agoPatti
7 months agoHillary
7 months agoCarmen
7 months agoMelita
7 months agoNieves
8 months agoLili
8 months agoCordelia
9 months agoDulce
9 months agoSelma
9 months agoGene
9 months agoIsaiah
9 months agoDenise
10 months agoCecily
10 months agoDonte
11 months agoRoxane
12 months agoDomitila
1 years ago