最新的 Databricks Certification Associate-Developer-Apache-Spark-3.5 免費考試真題:

1. 23 of 55.
A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.
Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)

A) It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.
B) It processes data solely on disk storage, reducing the need for memory resources.
C) It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.
D) It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.
E) It eliminates the need to write any code, automatically handling all data processing.

2. A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.
Which save mode and method should be used?

A) save with mode Ignore
B) saveAsTable with mode Overwrite
C) saveAsTable with mode ErrorIfExists
D) save with mode ErrorIfExists

3. 48 of 55.
A data engineer needs to join multiple DataFrames and has written the following code:
from pyspark.sql.functions import broadcast
data1 = [(1, "A"), (2, "B")]
data2 = [(1, "X"), (2, "Y")]
data3 = [(1, "M"), (2, "N")]
df1 = spark.createDataFrame(data1, ["id", "val1"])
df2 = spark.createDataFrame(data2, ["id", "val2"])
df3 = spark.createDataFrame(data3, ["id", "val3"])
df_joined = df1.join(broadcast(df2), "id", "inner") \
.join(broadcast(df3), "id", "inner")
What will be the output of this code?

A) The code will fail because only one broadcast join can be performed at a time.
B) The code will result in an error because broadcast() must be called before the joins, not inline.
C) The code will work correctly and perform two broadcast joins simultaneously to join df1 with df2, and then the result with df3.
D) The code will fail because the second join condition (df2.id == df3.id) is incorrect.

4. 40 of 55.
A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.
The original code:
from pyspark.sql import functions as F
min_price = 110.50
result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*")) Which code block should the developer use to refactor the code?

A) result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))
B) result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()
C) result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()
D) result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))

5. A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.
Which change should be made to solve the issue?

A) Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy
B) Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges
C) Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges
D) Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

問題與答案：

問題 #1
答案： A,C

問題 #2
答案： C

問題 #3
答案： C

問題 #4
答案： A

問題 #5
答案： D