Distributed data platforms • real-time ingestion • performance engineering

Hi there!
Welcome to my portfolio.

Portrait of Seungryul Andrew Lee
Product-minded AI Data Engineer

Seungryul Andrew Lee

Building production-grade data infrastructure, real-time analytics systems, and AI-enabled platforms across Azure, AWS, Databricks, Spark, and modern API services.

Databricks Spark Azure AWS FastAPI ML workflow Agentic AI

I’m someone who loves working with data and AI. While AI is transforming how we work with data, it still needs thoughtful engineering behind it. Reliable infrastructure, clear direction, and strong validation. I enjoy building those foundations. That’s who I am.

View projects Contact
0
Years in data/AI/ML/platform engineering
0
Avg inference latency
0
Streaming throughput
0
Simulated SaaS events
Reliability
idempotency, retries, backfills
Performance
partitioning, tuning, cost
Trust
DQ gates, drift detection
Interfaces
clean tables + APIs
Systems I ship typical end-to-end flow
APIs / Events
Ingestion + Contracts
Lakehouse / Warehouse
Transforms + DQ
Serving + Monitoring

Projects

Enterprise work + self-built systems

AI Data Cleaner Web Application

2026 • Full‑stack data tooling
FastAPI Pandas JavaScript Vercel Render
CSV + XLSXdata ingestion
Automateddata cleaning
Liveweb deployment

Built a full‑stack data cleaning web application that automatically detects and fixes common data quality issues in CSV and Excel datasets. The platform performs duplicate detection, date normalization, numeric standardization, and configurable missing‑value handling while providing before/after previews of the cleaned dataset.

  • Developed a FastAPI backend using Pandas to perform automated dataset profiling and cleaning.
  • Implemented configurable cleaning modes including strict removal, safe retention, and automatic value filling.
  • Built a responsive frontend interface supporting drag‑and‑drop uploads and interactive preview tables.
  • Deployed the backend on Render and the frontend on Vercel to provide a publicly accessible data cleaning tool.
Live App GitHub

Enterprise Social ROI & Attribution Engine

Jan 2026 – Feb 2026 • Associated with QVC
Databricks Delta Lake PySpark YouTube API Key Vault
ServerlessELT
Idempotentmerge / upsert
DQ gatespre-commit

Architected and deployed a serverless, end-to-end ELT framework on Databricks to quantify the financial impact of social media marketing. The system integrates real-time platform metrics with enterprise-scale clickstream data to enable granular ROI analysis for Influencer and Native content.

🏆 QVC YouTube Challenge Star Award — Recognized for delivering analytics infrastructure and automated reporting for YouTube campaign performance and marketing ROI insights.
  • High-volume ingestion: built automated connectors for the YouTube Reporting & Analytics APIs and merged with large-scale internal clickstream logs.
  • Delta Lake architecture: built robust PySpark pipelines with ACID transactions and idempotent merge logic for schema evolution and backfills.
  • Data quality: enforced schema checks and null constraints prior to commit.
  • Security + optimization: integrated OAuth 2.0 via Azure Key Vault and used serverless compute to optimize cost.
GitHub

Neural Tensiometry: Real-Time Surface Tension Prediction

Jan 2026 – Feb 2026 • Research replication
Python TensorFlow OpenCV SciPy Physics sim
<1sper image
MAE0.119
> 0.99

Built an end-to-end machine learning system that automatically measures surface tension from droplet images in under 1 second — 1000× faster than conventional methods.

  • Developed a physics-based synthetic data generator using Young–Laplace equations.
  • Trained a 5-layer deep neural network on 10,000+ droplet shapes.
  • Implemented a complete computer vision pipeline for real-time image processing.
  • Replicated published research results from Kratz & Kierfeld.

Real-Time Fraud Detection System

Jan 2026 – Feb 2026 • Production ML pipeline
Kafka Redis XGBoost FastAPI Docker
4msavg inference
200+ TPSthroughput
AUC0.94

Designed and deployed a production-grade fraud detection system achieving ~4ms average inference latency while maintaining strong fraud-detection performance across 200+ transactions per second.

  • Architected a streaming ML pipeline using Kafka for ingestion and Redis for low-latency feature caching.
  • Implemented real-time features capturing transaction velocity, geolocation anomalies, and temporal patterns.
  • Handled imbalance using SMOTE and cost-sensitive learning.
  • Built monitoring with a dashboard and containerized deployment using Docker.

AI Data Intelligence Agent

2026 • AI Analytics System
Python FastAPI Streamlit DuckDB OpenAI
AIAnalytics Copilot
NL → SQLQuery Engine
LLMSQL Generation
InteractiveAnalytics UI

Built an AI-powered analytics assistant that converts natural language questions into executable SQL queries and returns structured insights through an interactive dashboard.

  • Designed a full-stack architecture integrating Streamlit UI, FastAPI backend APIs, OpenAI LLMs, and DuckDB.
  • Implemented a natural-language-to-SQL pipeline that transforms business questions into executable database queries.
  • Built interactive dashboard components that visualize campaign metrics, revenue comparisons, and query results in real time.
  • Created an end-to-end AI analytics workflow bridging natural language questions, structured data queries, and automated insights.
GitHub

Usage-Based SaaS CLV & Churn Prediction Platform

Dec 2025 – Feb 2026 • End-to-end ML platform
PostgreSQL SQLAlchemy XGBoost Docker Feature eng
400K+events
300K+observations
Leakage-safelabels

Designed and built an end-to-end machine learning platform to predict customer churn and forecast 90-day revenue for a simulated usage-based SaaS product.

  • Generated 400K+ events across 2,000 accounts and ingested them into a Dockerized PostgreSQL warehouse.
  • Built behavioral features across usage intensity, latency, errors, payment failures, and support activity.
  • Created forward-looking labels designed to avoid data leakage.
  • Trained XGBoost models under heavy class imbalance.

Axtria DataMAx

Jul 2023 – Jan 2025 • Axtria
ADF Databricks Snowflake Redshift DQ
13+domains
Sparktuning
APIsFastAPI

Worked as a Data Engineer on DataMAx, a cloud data warehouse and integration product supporting ingestion, profiling, cataloging, quality, provisioning, and governance for large-scale DWBI needs.

  • Designed warehouse models across pharmaceutical domains.
  • Built and optimized pipelines using Azure Data Factory, Databricks, and Python.
  • Worked across Snowflake and Redshift for schema design and performance tuning.
  • Developed validation and monitoring frameworks plus production-grade data services via FastAPI.

Autonomous Perception Data Platform & Model Ops

May 2022 – Mar 2023 • Humanf
Python C++ Dataset ops Training Monitoring
Camera + LiDARsync
Repropipelines
SLAfallbacks

Engineered the end-to-end data path for an autonomous perception stack, including synchronized camera and LiDAR ingestion, dataset generation, and model packaging for reliable real-time inference.

  • Built ingestion services for multimodal sensor data and aligned timestamps for training and inference.
  • Created reproducible dataset pipelines with metadata, versioning, and standard splits.
  • Packaged models for low-latency inference with health checks and fallback logic.
  • Established validation gates and drift monitoring.

Project Source Code

Curated GitHub folders for the systems shown in this portfolio

Experience

Production delivery across data platforms, APIs, and distributed systems

AI Data Engineer
QVC Group • 2025 – Present • New York, NY
  • Architected real-time personalization pipelines using Azure Data Factory, Databricks (PySpark), and Synapse.
  • Designed a Delta Lake–based lakehouse architecture and optimized Spark partitioning and storage layouts to reduce execution cost.
  • Built production FastAPI services exposing curated metrics and model outputs.
  • Owned YouTube campaign reporting models and dashboards to measure performance, engagement, and marketing ROI.
  • Contributed to Meta campaign analytics and cross-channel reporting.
  • Implemented statistical validation and monitoring to detect drift and schema inconsistencies at scale.
Full Stack Engineer
Axtria Inc. • 2024 – 2025
  • Developed and versioned RESTful data services using FastAPI and Azure App Service.
  • Optimized large-scale Spark workloads via partitioning analysis, indexing improvements, and query tuning.
Data Engineer
Uber • 2023 – 2024
  • Architected distributed data pipelines powering Marketplace and Payments decision systems.
  • Built JVM-based (Scala) microservices serving consistent, high-quality features with strict SLAs.
  • Developed reporting pipelines and dashboards to surface marketplace and payments insights.
AI Data Engineer / Research Engineer
Humanf • 2022 – 2023
  • Built multimodal ingestion and synchronized camera/LiDAR streams for training and real-time inference.
  • Created reproducible dataset pipelines with versioning, validation gates, and monitoring for drift and schema changes.

Skills

Data engineering and platform strengths

Distributed Data Architecture
Lakehouse architecture, distributed data modeling, high-volume event ingestion, real-time and batch processing systems, API-driven data products
Cloud & Compute Infrastructure
Azure (ADF, Synapse, Fabric, ADLS, Azure ML), AWS (S3, Redshift), Databricks (PySpark), partitioning strategy and performance tuning
Systems & Optimization
Algorithm design, cost-performance tradeoff analysis, statistical validation frameworks, platform observability
Engineering & DevOps
FastAPI, REST APIs, CI/CD, monitoring, workflow orchestration, automated testing and validation

Certifications

Cloud, data engineering, and AI platform credentials

Data Warehousing with Databricks
Microsoft Fabric Analytics Engineer Associate (DP-600)
Azure AI Engineer Associate (AI-102)
Advanced Machine Learning Operations (Databricks)
Databricks Fundamentals (Academy Accreditation)

Education

New Jersey Institute of Technology

B.S. Applied Mathematics
New Jersey Institute of Technology
B.S. Computer Engineering
New Jersey Institute of Technology

Contact

Open to data platform, infrastructure, and AI engineering roles

Location New York, NY
Phone 862-682-1497
Copied