| Preplaced

21 September 2023

Having interviewed with over 100 companies, including top-tier tech firms, one thing has become crystal clear the Spark round is crucial in crack top companies

Having interviewed with over 100 companies, including top-tier tech firms, one thing has become crystal clear: the Spark round is crucial when it comes to landing your dream role Walmart Amazon Meesho Morgan Stanley JPMorganChase Apple Google Mastercard American Express Deutsche Bank

𝐁𝐚𝐬𝐞𝐝 𝐨𝐧 𝐦𝐲 𝐞𝐱𝐩𝐞𝐫𝐢𝐞𝐧𝐜𝐞, 𝐡𝐞𝐫𝐞 𝐚𝐫𝐞 𝐭𝐡𝐞 𝐓𝐨𝐩 10 𝐓𝐨𝐮𝐠𝐡𝐞𝐬𝐭 𝐒𝐩𝐚𝐫𝐤 & 𝐒𝐩𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 𝐭𝐡𝐚𝐭 𝐲𝐨𝐮 𝐜𝐚𝐧 𝐞𝐱𝐩𝐞𝐜𝐭 𝐝𝐮𝐫𝐢𝐧𝐠 𝐲𝐨𝐮𝐫 𝐢𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰

📌 How would you optimize a Spark job that processes billions of records, but it’s running slower than expected? What steps would you take to identify the bottleneck?

📌 Explain the difference between cache() and persist() in Spark. When would you use one over the other in a real-world scenario?

📌 You have a Spark job that performs multiple transformations on a large dataset. How can you minimize the number of stages in your job to improve performance?

📌 Given a Spark DataFrame, how would you optimize a groupBy operation that involves large datasets? Can you reduce the shuffle involved?

📌 What is the impact of partitioning on Spark performance? How would you decide on the number of partitions for a given DataFrame?

📌 You are working with a skewed dataset in Spark, where one partition has significantly more data than others. How would you handle data skew to optimize performance?📌 Explain how Spark’s Catalyst Optimizer works. How does it optimize queries, and how can you tune the Catalyst Optimizer for better performance?

📌 Imagine you’re reading data from an external source (e.g., a Hive table) and performing multiple joins. What optimizations would you implement to speed up the process?

📌 You’re asked to handle an ETL pipeline in Spark where you need to perform transformations on structured data. How would you leverage Spark SQL and DataFrame APIs for efficiency?

📌 You need to perform windowed operations on a large dataset (e.g., moving averages). How would you optimize Spark's execution for these types of operations to avoid expensive shuffles?