[Remote] PySpark & Delta Lake Developer

i4DM2 months ago
United States
Remote
Full-time
Junior Level (1-3 years)

Job Description

Position Overview

i4DM is an organization that provides federal agencies instant access to experienced and talented professionals. They are seeking an experienced PySpark & Delta Lake Developer responsible for designing, building, and maintaining scalable ETL pipelines to process and analyze large-scale healthcare claims data. Location: Remote (Open to candidates in USA). i4DM was founded in 2002 and is headquartered in Millersville, Maryland, USA, with a workforce of 51-200 employees. For more information, please visit https://www.i4dm.com.

Key Responsibilities

  • Design, develop, and maintain robust ETL pipelines using PySpark and Delta Lake for large and complex healthcare data workloads
  • Implement and optimize data lake solutions using Delta Lake table formats, supporting ACID transactions, schema enforcement, and time travel
  • Write efficient, reusable, and well-documented PySpark scripts for data ingestion, transformation, cleansing, and aggregation
  • Collaborate with data engineers, architects, and data scientists to understand business and data requirements and translate them into scalable data solutions
  • Ensure data quality, consistency, lineage, and integrity across all stages of data processing
  • Troubleshoot, debug, and optimize PySpark applications and Delta Lake workflows for cost, speed, and reliability within AWS
  • Maintain detailed and up-to-date technical documentation of code, data pipelines, and standard operating procedures
  • Stay updated with the latest Delta Lake and Spark advancements, advocating for best practices in data management and analytics

Required Qualifications

  • Strong proficiency in Python and PySpark, with hands-on experience developing data pipelines
  • Advanced experience with Delta Lake and its ACID transaction and schema management features
  • Solid SQL skills for querying, joining, and optimizing data in distributed environments
  • Hands-on experience with AWS cloud data services (e.g., S3, Glue, EMR, Athena)
  • Familiarity with data lake concepts, partitioning, and performance tuning
  • Excellent communication skills and a desire to continuously learn and adapt to innovative technologies
  • Familiarity with CI/CD, version control (e.g., Git), and infrastructure as code
  • Experience with healthcare or claims data
  • Knowledge of data governance, security, data cataloging (AWS Glue Catalog), and compliance best practices
  • Strong ability to prioritize and execute tasks independently and within collaborative team environments
  • Previous experience working in a government or public sector setting

Required Skills

ETL Pipelines
AWS (S3, Glue, EMR, Athena)
Python
SQL
Troubleshooting and performance tuning
Data Quality and Governance
PySpark
Data Lake architecture
Delta Lake
CI/CD and version control