Distributed data platforms • real-time ingestion • performance engineering

Hi there!
Welcome to my portfolio.

I’m someone who loves working with data and AI. While AI is transforming how we work with data, it still needs thoughtful engineering behind it. Reliable infrastructure, clear direction, and strong validation. I enjoy building those foundations. That’s who I am.

View projects Contact

Years in data/AI/ML/platform engineering

Avg inference latency

Streaming throughput

Simulated SaaS events

Reliability

idempotency, retries, backfills

Performance

partitioning, tuning, cost

Trust

DQ gates, drift detection

Interfaces

clean tables + APIs

Systems I ship typical end-to-end flow

APIs / Events

Ingestion + Contracts

Lakehouse / Warehouse

Transforms + DQ

Serving + Monitoring

Projects

Enterprise work + self-built systems

AI Data Cleaner Web Application

2026 • Full‑stack data tooling

FastAPI Pandas JavaScript Vercel Render

CSV + XLSXdata ingestion

Automateddata cleaning

Liveweb deployment

Built a full‑stack data cleaning web application that automatically detects and fixes common data quality issues in CSV and Excel datasets. The platform performs duplicate detection, date normalization, numeric standardization, and configurable missing‑value handling while providing before/after previews of the cleaned dataset.

Developed a FastAPI backend using Pandas to perform automated dataset profiling and cleaning.
Implemented configurable cleaning modes including strict removal, safe retention, and automatic value filling.
Built a responsive frontend interface supporting drag‑and‑drop uploads and interactive preview tables.
Deployed the backend on Render and the frontend on Vercel to provide a publicly accessible data cleaning tool.

Live App GitHub

Enterprise Social ROI & Attribution Engine

Jan 2026 – Feb 2026 • Associated with QVC

Databricks Delta Lake PySpark YouTube API Key Vault

ServerlessELT

Idempotentmerge / upsert

DQ gatespre-commit

Architected and deployed a serverless, end-to-end ELT framework on Databricks to quantify the financial impact of social media marketing. The system integrates real-time platform metrics with enterprise-scale clickstream data to enable granular ROI analysis for Influencer and Native content.

🏆 QVC YouTube Challenge Star Award — Recognized for delivering analytics infrastructure and automated reporting for YouTube campaign performance and marketing ROI insights.

High-volume ingestion: built automated connectors for the YouTube Reporting & Analytics APIs and merged with large-scale internal clickstream logs.
Delta Lake architecture: built robust PySpark pipelines with ACID transactions and idempotent merge logic for schema evolution and backfills.
Data quality: enforced schema checks and null constraints prior to commit.
Security + optimization: integrated OAuth 2.0 via Azure Key Vault and used serverless compute to optimize cost.

GitHub

Neural Tensiometry: Real-Time Surface Tension Prediction

Jan 2026 – Feb 2026 • Research replication

Python TensorFlow OpenCV SciPy Physics sim

<1sper image

MAE0.119

R²> 0.99

Built an end-to-end machine learning system that automatically measures surface tension from droplet images in under 1 second — 1000× faster than conventional methods.

Developed a physics-based synthetic data generator using Young–Laplace equations.
Trained a 5-layer deep neural network on 10,000+ droplet shapes.
Implemented a complete computer vision pipeline for real-time image processing.
Replicated published research results from Kratz & Kierfeld.

Real-Time Fraud Detection System

Jan 2026 – Feb 2026 • Production ML pipeline

Kafka Redis XGBoost FastAPI Docker

4msavg inference

200+ TPSthroughput

AUC0.94

Designed and deployed a production-grade fraud detection system achieving ~4ms average inference latency while maintaining strong fraud-detection performance across 200+ transactions per second.

Architected a streaming ML pipeline using Kafka for ingestion and Redis for low-latency feature caching.
Implemented real-time features capturing transaction velocity, geolocation anomalies, and temporal patterns.
Handled imbalance using SMOTE and cost-sensitive learning.
Built monitoring with a dashboard and containerized deployment using Docker.

AI Data Intelligence Agent

2026 • AI Analytics System

Python FastAPI Streamlit DuckDB OpenAI

AIAnalytics Copilot

NL → SQLQuery Engine

LLMSQL Generation

InteractiveAnalytics UI

Built an AI-powered analytics assistant that converts natural language questions into executable SQL queries and returns structured insights through an interactive dashboard.

Designed a full-stack architecture integrating Streamlit UI, FastAPI backend APIs, OpenAI LLMs, and DuckDB.
Implemented a natural-language-to-SQL pipeline that transforms business questions into executable database queries.
Built interactive dashboard components that visualize campaign metrics, revenue comparisons, and query results in real time.
Created an end-to-end AI analytics workflow bridging natural language questions, structured data queries, and automated insights.

GitHub

Usage-Based SaaS CLV & Churn Prediction Platform

Dec 2025 – Feb 2026 • End-to-end ML platform

PostgreSQL SQLAlchemy XGBoost Docker Feature eng

400K+events

300K+observations

Leakage-safelabels

Designed and built an end-to-end machine learning platform to predict customer churn and forecast 90-day revenue for a simulated usage-based SaaS product.

Generated 400K+ events across 2,000 accounts and ingested them into a Dockerized PostgreSQL warehouse.
Built behavioral features across usage intensity, latency, errors, payment failures, and support activity.
Created forward-looking labels designed to avoid data leakage.
Trained XGBoost models under heavy class imbalance.

Axtria DataMAx

Jul 2023 – Jan 2025 • Axtria

ADF Databricks Snowflake Redshift DQ

13+domains

Sparktuning

APIsFastAPI

Worked as a Data Engineer on DataMAx, a cloud data warehouse and integration product supporting ingestion, profiling, cataloging, quality, provisioning, and governance for large-scale DWBI needs.

Designed warehouse models across pharmaceutical domains.
Built and optimized pipelines using Azure Data Factory, Databricks, and Python.
Worked across Snowflake and Redshift for schema design and performance tuning.
Developed validation and monitoring frameworks plus production-grade data services via FastAPI.

Autonomous Perception Data Platform & Model Ops

May 2022 – Mar 2023 • Humanf

Python C++ Dataset ops Training Monitoring

Camera + LiDARsync

Repropipelines

SLAfallbacks

Engineered the end-to-end data path for an autonomous perception stack, including synchronized camera and LiDAR ingestion, dataset generation, and model packaging for reliable real-time inference.

Built ingestion services for multimodal sensor data and aligned timestamps for training and inference.
Created reproducible dataset pipelines with metadata, versioning, and standard splits.
Packaged models for low-latency inference with health checks and fallback logic.
Established validation gates and drift monitoring.

Project Source Code

Curated GitHub folders for the systems shown in this portfolio

sl237-lee/sl237-lee.github.io

All source code lives inside a single portfolio repository under the projects/ folder

Open Main Repo

Browse project folders directly in GitHub.

age-detector

Computer vision project for age detection with an interactive app workflow.

Python

App

clv_churn_platform

Usage-based SaaS churn and customer lifetime value prediction platform.

Python

XGBoost

PostgreSQL

data_intelligence_agent_platform

Natural-language-to-SQL analytics assistant built with FastAPI, Streamlit, DuckDB, and LLM orchestration.

Python

FastAPI

DuckDB

data_model_work_example

Example data modeling and warehouse design work focused on clean analytical structures.

SQL

Modeling

Warehouse

fraud_detection_system

Real-time fraud detection pipeline using streaming features, ML scoring, and low-latency serving.

Python

Kafka

XGBoost

qvc_unified_data_foundation

Enterprise-style data foundation work for unified analytics, reporting, and scalable platform ingestion.

Data Platform

Databricks

Analytics

Each link opens the exact source folder for the project inside this portfolio repository.

Experience

Production delivery across data platforms, APIs, and distributed systems

AI Data Engineer

QVC Group • 2025 – Present • New York, NY

Architected real-time personalization pipelines using Azure Data Factory, Databricks (PySpark), and Synapse.
Designed a Delta Lake–based lakehouse architecture and optimized Spark partitioning and storage layouts to reduce execution cost.
Built production FastAPI services exposing curated metrics and model outputs.
Owned YouTube campaign reporting models and dashboards to measure performance, engagement, and marketing ROI.
Contributed to Meta campaign analytics and cross-channel reporting.
Implemented statistical validation and monitoring to detect drift and schema inconsistencies at scale.

Full Stack Engineer

Axtria Inc. • 2024 – 2025

Developed and versioned RESTful data services using FastAPI and Azure App Service.
Optimized large-scale Spark workloads via partitioning analysis, indexing improvements, and query tuning.

Data Engineer

Uber • 2023 – 2024

Architected distributed data pipelines powering Marketplace and Payments decision systems.
Built JVM-based (Scala) microservices serving consistent, high-quality features with strict SLAs.
Developed reporting pipelines and dashboards to surface marketplace and payments insights.

AI Data Engineer / Research Engineer

Humanf • 2022 – 2023

Built multimodal ingestion and synchronized camera/LiDAR streams for training and real-time inference.
Created reproducible dataset pipelines with versioning, validation gates, and monitoring for drift and schema changes.

Skills

Data engineering and platform strengths

Distributed Data Architecture

Lakehouse architecture, distributed data modeling, high-volume event ingestion, real-time and batch processing systems, API-driven data products

Cloud & Compute Infrastructure

Azure (ADF, Synapse, Fabric, ADLS, Azure ML), AWS (S3, Redshift), Databricks (PySpark), partitioning strategy and performance tuning

Systems & Optimization

Algorithm design, cost-performance tradeoff analysis, statistical validation frameworks, platform observability

Engineering & DevOps

FastAPI, REST APIs, CI/CD, monitoring, workflow orchestration, automated testing and validation

Certifications

Cloud, data engineering, and AI platform credentials

Data Warehousing with Databricks

Microsoft Fabric Analytics Engineer Associate (DP-600)

Azure AI Engineer Associate (AI-102)

Advanced Machine Learning Operations (Databricks)

Databricks Fundamentals (Academy Accreditation)

Education

New Jersey Institute of Technology

B.S. Applied Mathematics

New Jersey Institute of Technology

B.S. Computer Engineering

New Jersey Institute of Technology

Contact

Open to data platform, infrastructure, and AI engineering roles

Location New York, NY

Email srlee02099@gmail.com

Phone 862-682-1497

Resume (PDF) GitHub LinkedIn

Hi there! Welcome to my portfolio.

Seungryul Andrew Lee

Projects

Project Source Code

Experience

Skills

Certifications

Education

Contact

Hi there!
Welcome to my portfolio.