Data Engineer Questionnaire
Please complete the questions below as part of our application process:
* = required field
*Phone: (XXX) XXX-XXXX
*City, State, Zip:
*1. In a data warehouse, what is the difference between a Star and Snowflake schema?
*2. A work table named WorkContracts has the following columns: ContractId, Product, and Value. A dimension named DimContracts has a surrogate key field called ContractKey and the same 3 columns above. ContractId is the unique business key. Write the most efficient T-SQL statement (as you would if querying just a SQL Server database) that produces all rows in WorkContracts that are not in DimContracts.
*3. A table named DimAdvertisers has the following columns: AdvertiserKey, AdvertiserId, AdvertiserName, and City. AdvertiserKey is the primary key / unique clustered index for the table. There are no other indexes on the table. The following query is running slowly:
Select AdvertiserId, AdvertiserName
Where City = 'Denver'
Write an index that would improve this query performance.
*4. List 3 steps you would take to improve the performance of a poor performing query.
*5. Using the example given in question 2, and assuming “WorkContract” is defined as a DataFrame called “workDF” & “DimContracts” is defined as a DataFrame called “dimDF”, write the same “query” resulting in the same desired output using Scala and the Spark framework.
*6. You are tasked with processing a 100GB file nightly consisting of transactional data. The file consists of the following 'columns”: TransactionID, TransactionDateTime, Network, Advertiser, Subscriber. Processing as a single batch is proving problematic based on the resources of the machine performing the transformation process. Provide 3 different approaches you would take in order to complete processing in a reasonable time each night.
*7. Explain each of the following join types: inner, cross, full outer, left outer, right outer, left anti semi, left semi
*8.Using Scala, write a method to produce 2 lists of sequential numbers, the first being from 1 to 1000, the second being from 400 to 1200
*9. Take the 2 lists produced in question 8, and using just Scala, write an operation that will produce a single list where the 2 lists intersect
*10. Take the 2 lists produced in question 8, and using just Spark, convert the 2 lists to DataFrames and perform a join via Spark, producing a DataFrame with the same result as in question 9
*11. Name 3 benefits of using Docker
*12. Given a DataFrame consisting of: ClientName, TransactionYear, TransactionMonth, TransactionWeek, Revenue. Write 2 Spark “queries” producing the total revenue sum per client by year and by year and month. Use direct Spark SQL functions rather than relational spark.sql queries if possible
*13. Explain the difference between “val”, “var”, “def”, “lazy val”, “lazy var” in Scala
NOTE: If you haven't already done so, please send your resume to email@example.com.
BIAnalytix Rate Card™
1200 17th St. | Suite 770
Denver CO, 80202
Copyright © 2019 Decentrix Inc. All rights reserved.