CargoIMP Spark Parser

cargoimp-spark-parser is a Scala library that extends your Apache Spark jobs with native high-performance parsing for IATA Cargo-IMP (International Air Transport Association Cargo Interchange Message Procedures) messages. It empowers Spark users in air cargo, logistics, and data engineering domains by making Cargo-IMP message types—such as FHL and FWB—directly accessible, explorable, and analyzable within Spark DataFrames and SQL queries. Project Link: https://rokorolev.gitlab.io/cargoimp-spark-parser/

2025-09-04 · 1 min · rokorolev

Fantastic Spork

In real-world analytics, Spark users often need to do things like count substrings, tally words in collections, or process text—tasks not always convenient with Spark’s built-in SQL functions. fantastic-spork delivers production-ready, native Catalyst expressions for these cases, ensuring top Spark performance and seamless integration. More efficient than regular Scala UDFs Convenient SQL extensions Composable for DataFrame, Dataset, and SQL APIs Project Link: https://rokorolev.gitlab.io/fantastic-spork/

2025-09-04 · 1 min · rokorolev

RedactifyAI

RedactifyAI is a Python package for detecting and anonymizing sensitive Personally Identifiable Information (PII) in textual data using Microsoft’s Presidio and Apache Spark. Key Features Integration with Presidio to detect and anonymize PII such as names, emails, phone numbers, and more. Spark-powered processing for scalable anonymization using PySpark. Custom recognizers to extend PII detection for specific needs. Project Link: https://rokorolev.gitlab.io/redactify-ai/

2025-09-04 · 1 min · rokorolev

SafetyCultureToDatabricks

This project is a data integration tool that collects information from the SafetyCulture API v1.0 and inserts it into Databricks Delta tables. It is designed to automate the extraction, transformation, and loading (ETL) of SafetyCulture data for analytics and reporting in Databricks. Project Link: https://rokorolev.gitlab.io/safety-culture-to-databricks/

2025-09-04 · 1 min · rokorolev

Sparqlin

sparqlin is a Spark SQL framework designed to simplify job creation and management in Databricks environments. It integrates with Spark SQL and PySpark for a streamlined development experience. The framework was specifically created to empower data analysts who may not have deep development skills. It provides a streamlined approach to adopting standard software development life cycles, enabling analysts to focus on working with data without the need to master complex programming paradigms. By leveraging familiar tools like SQL scripts and YAML files, the framework simplifies tasks such as data configuration, transformation, and testing. ...

2025-09-04 · 1 min · rokorolev

TableauUsageToDatabricks

TableauUsageToDatabricks is a .NET application designed to extract Tableau usage data and upload it to Databricks in a structured format. It parses Tableau XML and JSON files, transforms them into models, and writes the results as Parquet files for analytics and reporting in Databricks. Project Link: https://rokorolev.gitlab.io/tableau-usage-to-databricks/

2025-09-04 · 1 min · rokorolev

CarrierPerformanceReportsEtl

CarrierPerformanceReportsEtl CarrierPerformanceReportsEtl is a production Spark/Scala data platform I architected and grew over ~4 years at WTG to ingest, evolve, and serve carrier & logistics performance analytics. I founded it as a solo engineer, then mentored rotating contributors while owning roadmap, standards, and release quality (acting de‑facto team & tech lead while titled Data Scientist / Senior Data Scientist). 1. Problem & Context Logistics operations required timely, reliable KPIs (turnaround, message latency, carrier performance) sourced from heterogeneous semi‑structured message streams and relational systems. Challenges: ...

2023-11-01 · 10 min · rokorolev