Hi, I’m Roman
Analytics & ML engineering lead with a builder mindset. I turn noisy operational, customer‑interaction, and event streams into governed, performant, ML‑ready datasets—then ship production models (classical + deep learning + LLM) with monitoring, lineage, and cost control baked in.
I focus on: reproducible Spark/Delta patterns, attribution & customer‑journey analytics, privacy/PII redaction at scale, streaming enrichment (Kafka → Delta), ML/LLM enablement (MLflow + Transformers), and KPI reliability for decision systems (BI at scale, telemetry feedback loops). I prefer high‑signal simplicity over accidental complexity.
Principles: contract‑driven schemas, incremental & testable transformations, early observability, and automation‑first developer ergonomics.
Core Focus Areas
- Data & Platform Engineering: Spark / Delta Lake patterns (batch + incremental), scalable ingestion, performance & cost tuning.
- Streaming & Event Enrichment: Kafka → Delta ingestion, SLA & observability instrumentation, hybrid batch/real‑time joins.
- ML / LLM Delivery: Feature curation, MLflow lifecycle, PyTorch / TensorFlow/Keras; Transformers & LLMs (BERT/DistilBERT/RoBERTa, sentence-transformers, Llama 3.x, Mixtral/Mistral, GPT‑OSS); RAG (vector search + embeddings), PEFT (LoRA/QLoRA), evaluation & drift signals.
- Customer & Attribution Analytics: Multi‑touch attribution (Markov / MCMC), funnel & churn analytics, survival / longitudinal models, anomaly detection.
- Privacy & Governance: Retention, classification, PII detection & redaction (Spark + Presidio), lineage & policy alignment.
- Developer Productivity: Internal frameworks, job scaffolds, reference architectures, docs-as-code (Hugo), reproducible runbooks (Zeppelin/Notebooks).
- Semantic & BI Reliability: KPI governance, usage telemetry, dashboard performance & quality gates.
Selected Projects
Project | One‑line Snapshot | Stack |
---|---|---|
CargoIMP Spark Parser | High‑performance Cargo-IMP parsing → structured Delta tables | Scala, Spark |
Fantastic Spork | Catalyst-native text & counting expressions (performance primitives) | Scala, Spark SQL |
Sparqlin | Structured Spark SQL/Delta job framework (declarative YAML) | Spark, PySpark, YAML |
RedactifyAI | AI PII detection & redaction replacing external DLP APIs | Python, Spark, Presidio |
TableauUsageToDatabricks | BI telemetry ingestion & KPI reliability modeling | .NET, Parquet, Databricks |
SafetyCultureToDatabricks | SaaS API → Delta ingestion automation | .NET, Delta Lake |
Explore all on the Projects page.
Platform & ML Stack
Processing & Storage: Spark (Scala/PySpark), Delta Lake, Parquet optimization, Kafka streaming, BigQuery, SQL Server, Ceph/S3 object storage.
ML / LLM: PyTorch, TensorFlow/Keras, MLflow (tracking & registry); Transformers & LLMs (BERT/DistilBERT/RoBERTa, sentence-transformers, Llama 3.x, Claude, GPT‑OSS, Gemini); Spark NLP; RAG pipelines (vector search + embeddings), PEFT (LoRA/QLoRA), weak supervision patterns, guardrails (policy / PII filters), drift & performance monitoring.
Data Modeling & Orchestration: Scala/Python pipeline development, dbt (analytics curation), declarative YAML job specs, modular parsing frameworks, feature pipelines, incremental strategies.
Governance & Privacy: Presidio redaction, retention & masking frameworks, lineage & metadata capture, quality gates (schema + semantic checks).
Productivity & Observability: GitLab/GitHub CI/CD, Databricks, Google Colab, Streamlit (review & compliance UIs), Hugo (docs-as-code), logging & metric instrumentation, runbook automation.
Languages: Scala, Python, SQL, C#, Go (tooling), R.
Impact Highlights
- ~70–80% cost reduction & ~30% latency improvement via in‑house PII redaction (RedactifyAI).
- Improved transit ETA accuracy by ~15% through feature engineering & model refinement (WTG platform).
25% reduction in processing errors with anomaly detection & streaming quality checks.
- Scaled BI ecosystem (~700 Tableau views / ~300 workbooks) with governance & automated validation.
- Delivered Responsible Sales & compliance analytics improving complaint handling and churn mitigation.
Principles I Care About
- Determinism > Cleverness – Re-runs should reproduce the same tables.
- Explicit Schemas – Fail early on drift; evolve intentionally.
- Small Modules – Easier to test, reason about, and optimize.
- Observability from Day 1 – Metrics & logs are features.
- Docs Close to Code – Generatable or embedded where possible.
- Governed Evolution – Lineage, quality gates, and retention aligned with policy from the start.
Potential Collaboration
I’m interested in (and open to adjacent) challenges involving:
- Standard / semi‑structured / unstructured → governed Delta / SQL transformations & incremental data modeling.
- Streaming enrichment & hybrid batch/real‑time feature pipelines / CDC + event joins.
- Attribution, churn, funnel & journey analytics at scale (customer behavior modeling).
- Advanced mathematical & statistical modeling of business & operational processes (survival/longitudinal, stochastic simulation, Monte Carlo, SDE numerics), customer behavior & propensity, logistics & network performance, investment & portfolio / risk, insurance & finance analytics.
- Privacy overlays (classification, redaction, retention) and compliant data activation.
- ML/LLM enablement frameworks (feature store patterns, model lifecycle, RAG, PEFT, evaluation & guardrails).
- Developer ergonomics & internal platform/product evolution (frameworks, scaffolds, governance automation).
If your problem sits at the intersection of data platforms, applied ML/LLMs, and quantitative modeling—or something nearby—reach out.
Publications & CV
Selected statistical research (Laplace distribution asymptotics, test power, robust alternatives, insurance risk modeling) listed on the CV along with peer‑reviewed article & monograph references.
Links & Contact
- GitLab: https://gitlab.com/rokorolev
- Linkedin: https://www.linkedin.com/in/roman-k-data-lead
- Email: roman_linkedin@yahoo.com
- CV: /cv/