The developer must have sound knowledge in Apache Spark and Python programming.
Deep experience in developing data processing tasks using pySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
Experience in deployment and operationalizing the code is added advantage – Have knowledge and skills in Devops/version control and containerization. Preferable – having deployment knowledge.
Create Spark jobs for data transformation and aggregation Produce unit tests for Spark transformations and helper methods
Write Scaladoc-style documentation with all code
Design data processing pipelines to perform batch and Real- time/stream analytics on structured and unstructured data
Spark query tuning and performance optimization – Good understanding of different file formats (ORC, Parquet, AVRO) to optimize queries/processing and compression techniques.