Data Engineer Questionnaire

Please complete the questions below as part of our application process:

* = required field

*Phone: (XXX) XXX-XXXX

*Name:

*Email: name@email.com

*Address:

*City, State, Zip:

*County:

*1. In a data warehouse, what is the difference between a Star and Snowflake schema?

*2. A work table named WorkContracts has the following columns: ContractId, Product, and Value. A dimension named DimContracts has a surrogate key field called ContractKey and the same 3 columns above. ContractId is the unique business key. Write the most efficient T-SQL statement (as you would if querying just a SQL Server database) that produces all rows in WorkContracts that are not in DimContracts.

*3. A table named DimAdvertisers has the following columns: AdvertiserKey, AdvertiserId, AdvertiserName, and City. AdvertiserKey is the primary key / unique clustered index for the table. There are no other indexes on the table. The following query is running slowly:
Select AdvertiserId, AdvertiserName
From Dim.Advertisers
Where City = 'Denver'
Write an index that would improve this query performance.

*4. List 3 steps you would take to improve the performance of a poor performing query.

*5. Using the example given in question 2, and assuming “WorkContract” is defined as a DataFrame called “workDF” & “DimContracts” is defined as a DataFrame called “dimDF”, write the same “query” resulting in the same desired output using Scala and the Spark framework.

*6. You are tasked with processing a 100GB file nightly consisting of transactional data. The file consists of the following 'columns”: TransactionID, TransactionDateTime, Network, Advertiser, Subscriber. Processing as a single batch is proving problematic based on the resources of the machine performing the transformation process. Provide 3 different approaches you would take in order to complete processing in a reasonable time each night.

*7. Explain each of the following join types: inner, cross, full outer, left outer, right outer, left anti semi, left semi

*8.Using Scala, write a method to produce 2 lists of sequential numbers, the first being from 1 to 1000, the second being from 400 to 1200

*9. Take the 2 lists produced in question 8, and using just Scala, write an operation that will produce a single list where the 2 lists intersect

*10. Take the 2 lists produced in question 8, and using just Spark, convert the 2 lists to DataFrames and perform a join via Spark, producing a DataFrame with the same result as in question 9

*11. Name 3 benefits of using Docker

*12. Given a DataFrame consisting of: ClientName, TransactionYear, TransactionMonth, TransactionWeek, Revenue. Write 2 Spark “queries” producing the total revenue sum per client by year and by year and month. Use direct Spark SQL functions rather than relational spark.sql queries if possible

*13. Explain the difference between “val”, “var”, “def”, “lazy val”, “lazy var” in Scala

*14. Please include the salary range you are looking for.

*15. Are you a US Citizen, or do you have a Green Card?

NOTE: If you haven't already done so, please send your resume to jobapps@decentrix.net.