Petroleum Analytics Platform Architecture (2015–2018)

This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows.

1. Core Principles

On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob).
No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto).
Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron).
Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL).

2. Data Ingestion

Source Type	Examples	Mechanism	Notes
Geoscience	LAS, SEG-Y	Batch file drop + ETL parse	Large binary + metadata extraction
Well / Ops	WITSML feeds	Batch pull / scheduled parse	Standardization step into Hive
ERP / Finance	CSV / RDBMS exports	Sqoop (RDBMS→HDFS), SSIS, Python/.NET ETL	Controlled nightly cadence
SCADA / Events	Downtime logs	Kafka 0.8/0.9 (where deployed) or Flume/Logstash	Early streaming footprint
Market / Pricing	Excel price decks	Staged in SQL then approved to config tables	Manual approval workflow

Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization.

3. Storage & Formats

Raw Zone: HDFS (gzip/original form) + limited S3 buckets (hybrid tests).
Curated Zone: Hive tables (Parquet/ORC), partitioned (date, asset, well_id).
Reservoir simulation outputs: Eclipse / CMG binary grids + summary CSV parsed into Hive fact tables.
Master data (wells, facilities, contracts): SQL Server / Oracle system-of-record, mirrored to Hive dims.

4. Processing & Modeling Stack

Layer	Tools / Tech	Purpose
Batch Parallel	Spark 1.6 → 2.x	Monte Carlo sample gen, decline curve expansion, large joins
Statistical / Valuation	Python 2.7/3.5 (NumPy, SciPy, Pandas)	Economic metrics, real options calc
Advanced Stats	R (forecast, variance decomposition)	Specialized fitting & diagnostic tests
HPC	SLURM / PBS / GridEngine	Reservoir realizations & large path batches
Options	Custom Python/R (binomial, Longstaff–Schwartz)	Option premium computations
Correlation Sampling	Python (Cholesky)	Reproducible factor draws

Random seeds & correlation matrices persisted to control tables for reproducibility.

5. Fiscal & Economic Engine

Core cash waterfall: SQL stored procedures (SQL Server) or Python module for flexibility.
Scenario parameter schemas: price_deck, capex_curve, opex_class, tax_regime, option_trigger.
Run logging: run_log table referencing immutable parameter version rows.

6. Parameter & Assumption Management

Version tables: reserves_version, price_deck_version, cost_curve_version.
Approval workflow: SharePoint / email sign-off toggling approved=Y flag.
Identity via composite keys (model_version + parameter_version_ids); minimal hashing at the time.

7. Workflow Orchestration (Typical DAG)

Extract latest approved parameters (SQL).
Generate production forecasts (Spark decline expansion).
Launch Monte Carlo batches (Spark jobs or HPC submissions).
Aggregate distributions (Spark reduce / SQL summarize).
Apply real options overlay (Python step).
Persist summary percentiles (Hive + SQL Server).
Notify stakeholders (email distribution).

8. Reporting & Analytics

Executive BI: Spotfire / Tableau / early Power BI (NPV P10/P50/P90, tornado drivers).
Ad hoc: Excel pivots via ODBC to curated SQL views.
Storage optimization: Percentiles persisted; optional compressed binary full-path arrays (NFS) for audit replay.
Delta comparisons: SQL view diffing latest run vs prior approved baseline.

9. Integration with Business Units

Domain	Inbound Flow	Processing	Output Target
Geology	Petrel / Eclipse exports	ETL parse	Hive reservoir summaries
Drilling	Daily reports (PDF→manual XLS)	Structured capture	SQL + Hive staging
Production Ops	Hourly SCADA snapshot	Kafka / batch ingest	Event log / downtime model
Marketing	Monthly price deck Excel	Controlled ETL & approval	`price_deck_version`
Finance	ERP CAPEX/OPEX nightly	Sqoop import	Cost tables / allocation feeds
Risk	Correlation matrix	Import to `risk_correlation` module	Simulation inputs
Land / Infrastructure	Tariff & capacity schedules	Direct table loads	Constraint models

10. Security & Access

Hadoop: Kerberos (where maturity allowed) else OS ACLs.
SQL Server roles: FINANCE_READ, GEO_WRITE, ECON_ADMIN.
File share oversight: Manual audits (Windows DFS / NFS export lists).

11. Governance & Data Quality

Python DQ scripts logging exceptions to dq_issue (rule_id, entity, severity, timestamp).
Monthly DQ review & remediation tracking (manual register).
Lineage approximation: ETL job logs + timestamped folder patterns (no full automated lineage tooling yet).

12. Carbon / ESG (Early Stage)

Emission factor reference table (fuel gas, flaring, venting).
Static carbon price scenarios (low / base / high) feeding OPEX line.
No dynamic abatement / marginal cost optimization at that stage.

13. Real Options Implementation Constraints

Compute elasticity limited → overnight batch window; option runs weekly not continuous.
Focused scope: delay sanction, phased expansion, early abandonment (highest value density).
Longstaff–Schwartz regression on reduced path sample (≈2k) to conserve resources.

14. Performance Practices

Pre-aggregate well groups to reduce simulation dimensionality.
Spark broadcast for small dimension joins.
Persist intermediate Monte Carlo artifacts (sampled price paths) for reuse.
Convergence guardrail: escalate run count only when ENPV band instability > tolerance.

15. Monitoring & Logging

Scheduler logs (Oozie) + run_log status, duration, record counts.
Simple email alerts on failure; optional ELK (if PoC adopted) but not standardized.

16. Migration Path / Future Evolution

Evolution Step	Rationale	Outcome Target
Broader Airflow adoption	Transparency & modular DAG mgmt	Replace opaque XML workflows
MLflow / experiment tracking	Trace model / parameter evolution	Reproducibility maturity
Object store full-path persistence	Richer post-hoc analytics	Scenario drill-down, re-simulation
Metadata / lineage catalog (DataHub etc.)	Governance & impact analysis	Faster change assessment
Containerization & elasticity	Reduce queue latency	On-demand scaling

17. Architectural Summary Diagram

flowchart TD
  Sources["`Sources`"]
  Raw["`Raw Zone (HDFS)`"]
  Curated["Curated (Hive)"]
  Marts["Marts (Impala/Presto)"]
  Reservoir["Reservoir Grids"]
  Engines["Economic / Option Engines (Py/R)"]
  Outputs["Simulation Outputs (Hive + SQL)"]
  BI["Reporting / BI"]

  Sources --> Raw --> Curated --> Marts
  Raw --> Reservoir
  Curated --> Engines
  Engines --> Outputs --> BI
  Reservoir --> Outputs

18. Key Constraints vs Modern Design

2015–2018 Limitation	Modern Remedy
Limited elasticity (static cluster)	Serverless / autoscaling compute (Spark on k8s, Databricks)
Manual catalog & lineage	Active metadata graph (DataHub, OpenLineage)
Weekly options cycle	Event-driven incremental valuation
Percentiles-only storage	Columnar archival of full paths (Parquet / Delta)
Ad hoc approval workflow	Policy-driven parameter governance platform

Historical architecture reference; complements the upstream evaluation modeling framework.

1. Core Principles#

2. Data Ingestion#

3. Storage & Formats#

4. Processing & Modeling Stack#

5. Fiscal & Economic Engine#

6. Parameter & Assumption Management#

7. Workflow Orchestration (Typical DAG)#

8. Reporting & Analytics#

9. Integration with Business Units#

10. Security & Access#

11. Governance & Data Quality#

12. Carbon / ESG (Early Stage)#

13. Real Options Implementation Constraints#

14. Performance Practices#

15. Monitoring & Logging#

16. Migration Path / Future Evolution#

17. Architectural Summary Diagram#

18. Key Constraints vs Modern Design#