Olete.in – MCQs, Mock Tests & Government Job Prep| Databricks Databricks Mcq Question Set 1

1. Which one of the following is not a operations that can be performed using Azure Databricks?
It is Apache Spark based analytics platform
It helps to extract, transform and load the data
Visualization if data is not possible with it
All of the above

2. To which one of the following sources do Azure Databricks connect for collecting streaming data?
Kafka
Azure data lake
CosmosDB
none of the above

3. Which one of the following is a Databrick concept?
Workspace
Authentication and authorization
Data Management
All of the above

4. Which of the following ensures data reliability even after termination of cluster in Azure Databricks?
Databricks Runtime
Databricks File System
Dashboards
Workspace

5. Choose the correct option with respect to ETL operations of data in Azure Databricks?
For loading of data, data is moved from databricks to data warehouse
for loading of data, blob storage is used
Blob storage serves as a temporary storage
All of the above

6. Which one of the following is incorrect regarding Workspace of Azure Databricks concept?
It manages ETL operations of data
It can store notebooks, libraries and dashboards
It is the root folder of Azure Databricks
none of the above

7. Which of the following Azure datasources can be connected to Azure Databricks?
Azure Blob Storage
Azure Datawarehouse
Azure CosmosDB
All of the above

8. Streaming data can be captured by?
Kafka
Event Hubs
Both A and B
none of the above

9. Authentication and authorization in databricks can be managed for :
User, Group, Access Control List
User, Group
Access Control List
Group, Access Control List

10. Which one of the following is a set of components that run on clusters of Azure Databricks?
DataBricks File System
Databricks Runtime
CosmosDB
Azure Data Lake

11. Spark was initially started by ______ at UC Berkeley AMPLab in 2009.
Mahek Zaharia
Matei Zaharia
Doug Cutting
Stonebraker

12. ______ is a component on top of Spark Core.
Spark Streaming
Spark SQL
RDDs
All of the Mentioned

13. Spark SQL provides a domain-specific language to manipulate ___________ in Scala, Java, or Python.
Spark Streaming
Spark SQL
RDDs
All of the Mentioned

14. _______ leverages Spark Core fast scheduling capability to perform streaming analytics.
MLlib
Spark Streaming
GraphX
RDDs

15. ____ is a distributed machine learning framework on top of Spark.
MLlib
Spark Streaming
GraphX
RDDs

16. Given a dataframe df, select the code that returns its number of rows:
df.take(‘all’)
df.collect()
df.count()
df.numRows()

17. Users can easily run Spark on top of Amazon’s _____
Infosphere
EC2
EMR
None of the mentioned

18. Which of the following can be used to launch Spark jobs inside MapReduce?
SIM
SIMR
SIR
RIS

19. Which of the following language is not supported by Spark?
Java
Pascal
Scala
Python

20. Spark is packaged with higher level libraries, including support for _________ queries.
SQL
C
C++
None of the mentioned

21. Spark includes a collection over ________ operators for transforming data and familiar data frame APIs for manipulating semi-structured data.
50
60
70
80

22. Given a DataFrame df that includes a number of columns among which a column named quantity and a column named price, complete the code below such that it will create a DataFrame including all the original columns and a new column revenue defined as quantity*price:
df.withColumnRenamed(“revenue”, expr(“quantity*price”))
df.withColumn(revenue, expr(“quantity*price”))
df.withColumn(“revenue”, expr(“quantity*price”))
df.withColumn(expr(“quantity*price”), “revenue”)

23. Spark is engineered from the bottom-up for performance, running ______ faster than Hadoop by exploiting in memory computing and other optimizations.
100x
150x
200x
None of the mentioned

24. Spark powers a stack of high-level tools including Spark SQL, MLlib for _____
regression models
statistics
machine learning
reproductive research

25. For Multiclass classification problem which algorithm is not the solution?
Naive Bayes
Random Forests
Logistic Regression
Decision Trees

26. Which of the following is a tool of Machine Learning Library?
Persistence
Utilities like linear algebra, statistics
Pipelines
All of the above.

27. Which of the following is true for Spark core?
It is the kernel of Spark
It enables users to run SQL / HQL queries on the top of Spark.
It is the scalable machine learning library which delivers efficiencies
Improves the performance of iterative algorithm drastically.

28. Given a DataFrame df that has some null values in the column created_date, find the code below such that it will sort rows in ascending order based on the column creted_date with null values appearing last.
orderBy(asc_nulls_last(“created_date”))
sort(asc_nulls_last(“created_date”))
orderBy(col(“created_date”).asc_nulls_last())
orderBy(col(“created_date”), ascending=True))

29. Which of the following is true for Spark MLlib?
Provides an execution platform for all the Spark applications
It is the scalable machine learning library which delivers efficiencies
enables powerful interactive and data analytics application across live streaming data
All of the above

30. Which of the following is true for RDD?
We can operate Spark RDDs in parallel with a low-level API
RDDs are similar to the table in a relational database
It allows processing of a large amount of structured data
It has built-in optimization engine

31. RDD is fault-tolerant and immutable
True
False
Both
none of the mentioned

32. The read operation on RDD is
Fine-grained
Coarse-grained
Either fine-grained or coarse-grained
Neither fine-grained nor coarse-grained

33. The write operation on RDD is
Fine-grained
Coarse-grained
Either fine-grained or coarse-grained
Neither fine-grained nor coarse-grained

34. Which one of the following commands does NOT trigger an eager evaluation?
df.collect()
df.take()
df.show()
df.join() –&gt; CORRECT

35. Which one of the following command triggers an eager evaluation?
df.filter()
df.select()
df.show()
df.limit()

36. Is it possible to mitigate stragglers in RDD?
Yes
No
Both
None of the mentioned

37. Fault Tolerance in RDD is achieved using
Immutable nature of RDD
DAG (Directed Acyclic Graph)
Lazy-evaluation
none of the above

38. What is action in Spark RDD?
The ways to send result from executors to the driver
Takes RDD as input and produces one or more RDD as output.
Creates one or many new RDDs
All of the above

39. The shortcomings of Hadoop MapReduce was overcome by Spark RDD by
Lazy-evaluation
DAG
In-memory processing
All of the above

40. Spark is developed in which language
Java
Scala
Python
R

41. Which of the following is NOT an actions
foreach()
printSchema()
first()
reduce()

42. Which of the following is an actions
foreach()
printSchema()
cache()
sort()

43. Which of the following is a transformation?
foreach()
flatMap()
save()
count()

44. Which of the following is not a component of the Spark Ecosystem?
Sqoop
GraphX
MLlib
BlinkDB

45. Which of the following algorithm is not present in MLlib?
Streaming Linear Regression
Streaming KMeans
Tanimoto distance
none of the above

46. Which of the following is not the feature of Spark?
Supports in-memory computation
Fault-tolerance
It is cost-efficient
Compatible with other file storage system

47. Which of the following is the reason for Spark being Speedy than MapReduce?
DAG execution engine and in-memory computation
Support for different language APIs like Scala, Java, Python and R
RDDs are immutable and fault-tolerant
none of the above

48. Which of the following statements are NOT true for broadcast variables ?
Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task.
A custom broadcast class can be defined by extending org.apache.spark.utilbroadcastV2 in Java or Scala or pyspark.Accumulatorparams in Python. –&gt; CORRECT
It is a way of updating a value inside a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way.–&gt; CORRECT
It provides a mutable variable that Spark cluster can safely update on a per-row basis. –&gt; CORRECT

49. Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of being serialized with every single task.
True
False
Can’t Specify
None of the mentioned

50. broadcast variables are ______ and lazily replicated across all nodes in the cluster when an action is triggered
mutable
immutable
both
None of above