Hi, I’m Roman

Analytics & ML engineering lead with a builder mindset. I turn noisy operational, customer‑interaction, and event streams into governed, performant, ML‑ready datasets—then ship production models (classical + deep learning + LLM) with monitoring, lineage, and cost control baked in.

I focus on: reproducible Spark/Delta patterns, attribution & customer‑journey analytics, privacy/PII redaction at scale, streaming enrichment (Kafka → Delta), ML/LLM enablement (MLflow + Transformers), and KPI reliability for decision systems (BI at scale, telemetry feedback loops). I prefer high‑signal simplicity over accidental complexity.

Principles: contract‑driven schemas, incremental & testable transformations, early observability, and automation‑first developer ergonomics.

Core Focus Areas

Data & Platform Engineering: Spark / Delta Lake patterns (batch + incremental), scalable ingestion, performance & cost tuning.
Streaming & Event Enrichment: Kafka → Delta ingestion, SLA & observability instrumentation, hybrid batch/real‑time joins.
ML / LLM Delivery: Feature curation, MLflow lifecycle, PyTorch / TensorFlow/Keras; Transformers & LLMs (BERT/DistilBERT/RoBERTa, sentence-transformers, Llama 3.x, Mixtral/Mistral, GPT‑OSS); RAG (vector search + embeddings), PEFT (LoRA/QLoRA), evaluation & drift signals.
Customer & Attribution Analytics: Multi‑touch attribution (Markov / MCMC), funnel & churn analytics, survival / longitudinal models, anomaly detection.
Privacy & Governance: Retention, classification, PII detection & redaction (Spark + Presidio), lineage & policy alignment.
Developer Productivity: Internal frameworks, job scaffolds, reference architectures, docs-as-code (Hugo), reproducible runbooks (Zeppelin/Notebooks).
Semantic & BI Reliability: KPI governance, usage telemetry, dashboard performance & quality gates.

Selected Projects

Project	One‑line Snapshot	Stack
CargoIMP Spark Parser	High‑performance Cargo-IMP parsing → structured Delta tables	Scala, Spark
Fantastic Spork	Catalyst-native text & counting expressions (performance primitives)	Scala, Spark SQL
Sparqlin	Structured Spark SQL/Delta job framework (declarative YAML)	Spark, PySpark, YAML
RedactifyAI	AI PII detection & redaction replacing external DLP APIs	Python, Spark, Presidio
TableauUsageToDatabricks	BI telemetry ingestion & KPI reliability modeling	.NET, Parquet, Databricks
SafetyCultureToDatabricks	SaaS API → Delta ingestion automation	.NET, Delta Lake

Explore all on the Projects page.

Platform & ML Stack

Processing & Storage: Spark (Scala/PySpark), Delta Lake, Parquet optimization, Kafka streaming, BigQuery, SQL Server, Ceph/S3 object storage.
ML / LLM: PyTorch, TensorFlow/Keras, MLflow (tracking & registry); Transformers & LLMs (BERT/DistilBERT/RoBERTa, sentence-transformers, Llama 3.x, Claude, GPT‑OSS, Gemini); Spark NLP; RAG pipelines (vector search + embeddings), PEFT (LoRA/QLoRA), weak supervision patterns, guardrails (policy / PII filters), drift & performance monitoring.
Data Modeling & Orchestration: Scala/Python pipeline development, dbt (analytics curation), declarative YAML job specs, modular parsing frameworks, feature pipelines, incremental strategies.
Governance & Privacy: Presidio redaction, retention & masking frameworks, lineage & metadata capture, quality gates (schema + semantic checks).
Productivity & Observability: GitLab/GitHub CI/CD, Databricks, Google Colab, Streamlit (review & compliance UIs), Hugo (docs-as-code), logging & metric instrumentation, runbook automation.
Languages: Scala, Python, SQL, C#, Go (tooling), R.

Impact Highlights

~70–80% cost reduction & ~30% latency improvement via in‑house PII redaction (RedactifyAI).
Improved transit ETA accuracy by ~15% through feature engineering & model refinement (WTG platform).
25% reduction in processing errors with anomaly detection & streaming quality checks.
Scaled BI ecosystem (~700 Tableau views / ~300 workbooks) with governance & automated validation.
Delivered Responsible Sales & compliance analytics improving complaint handling and churn mitigation.

Principles I Care About

Determinism > Cleverness – Re-runs should reproduce the same tables.
Explicit Schemas – Fail early on drift; evolve intentionally.
Small Modules – Easier to test, reason about, and optimize.
Observability from Day 1 – Metrics & logs are features.
Docs Close to Code – Generatable or embedded where possible.
Governed Evolution – Lineage, quality gates, and retention aligned with policy from the start.

Potential Collaboration

I’m interested in (and open to adjacent) challenges involving:

Standard / semi‑structured / unstructured → governed Delta / SQL transformations & incremental data modeling.
Streaming enrichment & hybrid batch/real‑time feature pipelines / CDC + event joins.
Attribution, churn, funnel & journey analytics at scale (customer behavior modeling).
Advanced mathematical & statistical modeling of business & operational processes (survival/longitudinal, stochastic simulation, Monte Carlo, SDE numerics), customer behavior & propensity, logistics & network performance, investment & portfolio / risk, insurance & finance analytics.
Privacy overlays (classification, redaction, retention) and compliant data activation.
ML/LLM enablement frameworks (feature store patterns, model lifecycle, RAG, PEFT, evaluation & guardrails).
Developer ergonomics & internal platform/product evolution (frameworks, scaffolds, governance automation).

If your problem sits at the intersection of data platforms, applied ML/LLMs, and quantitative modeling—or something nearby—reach out.

Publications & CV

Selected statistical research (Laplace distribution asymptotics, test power, robust alternatives, insurance risk modeling) listed on the CV along with peer‑reviewed article & monograph references.

Links & Contact

GitLab: https://gitlab.com/rokorolev
Linkedin: https://www.linkedin.com/in/roman-k-data-lead
Email: roman_linkedin@yahoo.com
CV: /cv/

Hi, I’m Roman#

Core Focus Areas#

Selected Projects#

Platform & ML Stack#

Impact Highlights#

Principles I Care About#

Potential Collaboration#

Publications & CV#

Links & Contact#