Hi, I’m Roman

Analytics & ML engineering lead with a builder mindset. I turn noisy operational, customer‑interaction, and event streams into governed, performant, ML‑ready datasets—then ship production models (classical + deep learning + LLM) with monitoring, lineage, and cost control baked in.

I focus on: reproducible Spark/Delta patterns, attribution & customer‑journey analytics, privacy/PII redaction at scale, streaming enrichment (Kafka → Delta), ML/LLM enablement (MLflow + Transformers), and KPI reliability for decision systems (BI at scale, telemetry feedback loops). I prefer high‑signal simplicity over accidental complexity.

Principles: contract‑driven schemas, incremental & testable transformations, early observability, and automation‑first developer ergonomics.


Core Focus Areas


Selected Projects

ProjectOne‑line SnapshotStack
CargoIMP Spark ParserHigh‑performance Cargo-IMP parsing → structured Delta tablesScala, Spark
Fantastic SporkCatalyst-native text & counting expressions (performance primitives)Scala, Spark SQL
SparqlinStructured Spark SQL/Delta job framework (declarative YAML)Spark, PySpark, YAML
RedactifyAIAI PII detection & redaction replacing external DLP APIsPython, Spark, Presidio
TableauUsageToDatabricksBI telemetry ingestion & KPI reliability modeling.NET, Parquet, Databricks
SafetyCultureToDatabricksSaaS API → Delta ingestion automation.NET, Delta Lake

Explore all on the Projects page.


Platform & ML Stack

Processing & Storage: Spark (Scala/PySpark), Delta Lake, Parquet optimization, Kafka streaming, BigQuery, SQL Server, Ceph/S3 object storage.
ML / LLM: PyTorch, TensorFlow/Keras, MLflow (tracking & registry); Transformers & LLMs (BERT/DistilBERT/RoBERTa, sentence-transformers, Llama 3.x, Claude, GPT‑OSS, Gemini); Spark NLP; RAG pipelines (vector search + embeddings), PEFT (LoRA/QLoRA), weak supervision patterns, guardrails (policy / PII filters), drift & performance monitoring.
Data Modeling & Orchestration: Scala/Python pipeline development, dbt (analytics curation), declarative YAML job specs, modular parsing frameworks, feature pipelines, incremental strategies.
Governance & Privacy: Presidio redaction, retention & masking frameworks, lineage & metadata capture, quality gates (schema + semantic checks).
Productivity & Observability: GitLab/GitHub CI/CD, Databricks, Google Colab, Streamlit (review & compliance UIs), Hugo (docs-as-code), logging & metric instrumentation, runbook automation.
Languages: Scala, Python, SQL, C#, Go (tooling), R.


Impact Highlights

Principles I Care About

  1. Determinism > Cleverness – Re-runs should reproduce the same tables.
  2. Explicit Schemas – Fail early on drift; evolve intentionally.
  3. Small Modules – Easier to test, reason about, and optimize.
  4. Observability from Day 1 – Metrics & logs are features.
  5. Docs Close to Code – Generatable or embedded where possible.
  6. Governed Evolution – Lineage, quality gates, and retention aligned with policy from the start.

Potential Collaboration

I’m interested in (and open to adjacent) challenges involving:

If your problem sits at the intersection of data platforms, applied ML/LLMs, and quantitative modeling—or something nearby—reach out.

Publications & CV

Selected statistical research (Laplace distribution asymptotics, test power, robust alternatives, insurance risk modeling) listed on the CV along with peer‑reviewed article & monograph references.