[Remote] PySpark & Delta Lake Developer

i4DM2 months ago

United States

Remote

Full-time

Junior Level (1-3 years)

Job Description

Position Overview

i4DM is an organization that provides federal agencies instant access to experienced and talented professionals. They are seeking an experienced PySpark & Delta Lake Developer responsible for designing, building, and maintaining scalable ETL pipelines to process and analyze large-scale healthcare claims data. Location: Remote (Open to candidates in USA). i4DM was founded in 2002 and is headquartered in Millersville, Maryland, USA, with a workforce of 51-200 employees. For more information, please visit https://www.i4dm.com.

Key Responsibilities

Design, develop, and maintain robust ETL pipelines using PySpark and Delta Lake for large and complex healthcare data workloads
Implement and optimize data lake solutions using Delta Lake table formats, supporting ACID transactions, schema enforcement, and time travel
Write efficient, reusable, and well-documented PySpark scripts for data ingestion, transformation, cleansing, and aggregation
Collaborate with data engineers, architects, and data scientists to understand business and data requirements and translate them into scalable data solutions
Ensure data quality, consistency, lineage, and integrity across all stages of data processing
Troubleshoot, debug, and optimize PySpark applications and Delta Lake workflows for cost, speed, and reliability within AWS
Maintain detailed and up-to-date technical documentation of code, data pipelines, and standard operating procedures
Stay updated with the latest Delta Lake and Spark advancements, advocating for best practices in data management and analytics

Required Qualifications

Strong proficiency in Python and PySpark, with hands-on experience developing data pipelines
Advanced experience with Delta Lake and its ACID transaction and schema management features
Solid SQL skills for querying, joining, and optimizing data in distributed environments
Hands-on experience with AWS cloud data services (e.g., S3, Glue, EMR, Athena)
Familiarity with data lake concepts, partitioning, and performance tuning
Excellent communication skills and a desire to continuously learn and adapt to innovative technologies
Familiarity with CI/CD, version control (e.g., Git), and infrastructure as code
Experience with healthcare or claims data
Knowledge of data governance, security, data cataloging (AWS Glue Catalog), and compliance best practices
Strong ability to prioritize and execute tasks independently and within collaborative team environments
Previous experience working in a government or public sector setting

Required Skills

ETL Pipelines

AWS (S3, Glue, EMR, Athena)

Python

SQL

Troubleshooting and performance tuning

Data Quality and Governance

PySpark

Data Lake architecture

Delta Lake

CI/CD and version control