Data-Engineering

CargoIMP Spark Parser

cargoimp-spark-parser is a Scala library that extends your Apache Spark jobs with native high-performance parsing for IATA Cargo-IMP (International Air Transport Association Cargo Interchange Message Procedures) messages. It empowers Spark users in air cargo, logistics, and data engineering domains by making Cargo-IMP message types—such as FHL and FWB—directly accessible, explorable, and analyzable within Spark DataFrames and SQL queries. Project Link: https://rokorolev.gitlab.io/cargoimp-spark-parser/

Data & Analytics Technology Timeline 2025

A compact, opinionated snapshot of the data & advanced analytics ecosystem as of 2025. For each domain: (a) first public emergence, (b) broad adoption / peak phase, (c) current 2025 posture. Legend: Emerging: gaining traction / rapid iteration Active: healthy, broadly adopted, still improving Mature Plateau: stable, incremental change; little green‑field excitement Declining / Legacy: little new adoption; maintenance only Core Data & Big Data Engines Technology First Adoption Peak Window 2025 Status Notes Hadoop (HDFS/MapReduce) 2006 2009–2015 Declining / Legacy Replaced by cloud object storage + engines YARN 2012 2014–2018 Mature Plateau Still underpins legacy Hadoop clusters Hive (SQL on Hadoop) 2008 2012–2017 Mature Plateau Supplanted by Spark SQL / Trino in new builds Pig 2008 2011–2014 Declining Educational / migration only HBase 2008 2013–2017 Niche Legacy Large installs persist Impala 2012 2015–2018 Stable / Niche Limited net-new Presto / Trino 2013 / 2020 rename 2016–2023 Active Federated SQL + lakehouse query Parquet 2013 2015–now Dominant De‑facto columnar format ORC 2013 2015–2018 Mature Plateau Hive-heavy shops Apache Spark 2009 (early), 2014 1.x 2015–now Active Core batch + streaming engine Ray 2019 2022–now Emerging / Active Unified Python distributed workloads Dask 2016 2019–now Active Niche Python analytics / scientific Streaming & Event Processing Technology First Peak Window 2025 Status Notes Kafka 2011 2015–now Active Backbone for streams + CDC Kafka Connect 2015 2018–now Active Connector ecosystem standard Schema Registry 2016 2018–now Active Avro/Protobuf evolution control Flume 2010 2013–2016 Declining Displaced by Kafka connectors Storm 2011 2013–2016 Declining Replaced by Flink / Kafka Streams Flink 2011 2018–now Active Event-time & low-latency streaming Spark Streaming (DStreams) 2012 2014–2018 Legacy Replaced by Structured Streaming Structured Streaming 2016 2018–now Active Unified micro-batch streaming Orchestration & Workflow Technology First Peak Window 2025 Status Notes Oozie 2010 2012–2016 Legacy Legacy Hadoop workflow Airflow 2015 2018–now Active Batch / DAG orchestration leader Control-M 1990s Long-running Mature Plateau Enterprise batch mainstay Lakehouse & Storage Technology / Concept First Peak Window 2025 Status Notes AWS S3 2006 2010–now Active Foundational object store Azure Blob 2008 2013–now Active Core Azure storage GCS 2010 2015–now Active Google cloud object storage Databricks (platform) 2013 2016–now Expanding Consolidated lakehouse workloads Delta Lake 2019 OSS 2020–now High Growth ACID tables + time travel “Lakehouse” Term ~2020 2021–now Mainstream Marketing crystallized architecture Governance / Catalog / Lineage Technology First Peak Window 2025 Status Notes Apache Atlas 2015 2017–2021 Mature / Declining Hadoop-centric Amundsen 2019 2020–2023 Stable Slower vs DataHub DataHub 2020 2022–now Active / Growing Modern metadata / lineage platform ML / MLOps / Modeling Technology / Concept First Peak Window 2025 Status Notes MLflow 2018 2019–now Active Experiment + model registry Feature Store Concept ~2018 2020–now Mature Emerging Feast et al. normalization PyTorch 2016 2018–now Active / Dominant Research → production TensorFlow 2015 2016–2021 peak Mature Plateau Still strong, steadier share PyMC 2003 2015–now Active Bayesian workflows NumPy / SciPy / Pandas 2001 / 2006 / 2008 2012–now Core Foundational ecosystem Ray (ML scaling) 2019 2022–now Growth RL / distributed training LLM / Vector DB Layering 2022 2023–now High Growth Retrieval + augmentation patterns Academic reference: Longstaff–Schwartz (2001) → remains core for American-style options & real asset valuation. ...

CarrierPerformanceReportsEtl

CarrierPerformanceReportsEtl CarrierPerformanceReportsEtl is a production Spark/Scala data platform I architected and grew over ~4 years at WTG to ingest, evolve, and serve carrier & logistics performance analytics. I founded it as a solo engineer, then mentored rotating contributors while owning roadmap, standards, and release quality (acting de‑facto team & tech lead while titled Data Scientist / Senior Data Scientist). 1. Problem & Context Logistics operations required timely, reliable KPIs (turnaround, message latency, carrier performance) sourced from heterogeneous semi‑structured message streams and relational systems. Challenges: ...