This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows.


1. Core Principles

  • On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob).
  • No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto).
  • Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron).
  • Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL).

2. Data Ingestion

Source TypeExamplesMechanismNotes
GeoscienceLAS, SEG-YBatch file drop + ETL parseLarge binary + metadata extraction
Well / OpsWITSML feedsBatch pull / scheduled parseStandardization step into Hive
ERP / FinanceCSV / RDBMS exportsSqoop (RDBMS→HDFS), SSIS, Python/.NET ETLControlled nightly cadence
SCADA / EventsDowntime logsKafka 0.8/0.9 (where deployed) or Flume/LogstashEarly streaming footprint
Market / PricingExcel price decksStaged in SQL then approved to config tablesManual approval workflow

Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization.


3. Storage & Formats

  • Raw Zone: HDFS (gzip/original form) + limited S3 buckets (hybrid tests).
  • Curated Zone: Hive tables (Parquet/ORC), partitioned (date, asset, well_id).
  • Reservoir simulation outputs: Eclipse / CMG binary grids + summary CSV parsed into Hive fact tables.
  • Master data (wells, facilities, contracts): SQL Server / Oracle system-of-record, mirrored to Hive dims.

4. Processing & Modeling Stack

LayerTools / TechPurpose
Batch ParallelSpark 1.6 → 2.xMonte Carlo sample gen, decline curve expansion, large joins
Statistical / ValuationPython 2.7/3.5 (NumPy, SciPy, Pandas)Economic metrics, real options calc
Advanced StatsR (forecast, variance decomposition)Specialized fitting & diagnostic tests
HPCSLURM / PBS / GridEngineReservoir realizations & large path batches
OptionsCustom Python/R (binomial, Longstaff–Schwartz)Option premium computations
Correlation SamplingPython (Cholesky)Reproducible factor draws

Random seeds & correlation matrices persisted to control tables for reproducibility.


5. Fiscal & Economic Engine

  • Core cash waterfall: SQL stored procedures (SQL Server) or Python module for flexibility.
  • Scenario parameter schemas: price_deck, capex_curve, opex_class, tax_regime, option_trigger.
  • Run logging: run_log table referencing immutable parameter version rows.

6. Parameter & Assumption Management

  • Version tables: reserves_version, price_deck_version, cost_curve_version.
  • Approval workflow: SharePoint / email sign-off toggling approved=Y flag.
  • Identity via composite keys (model_version + parameter_version_ids); minimal hashing at the time.

7. Workflow Orchestration (Typical DAG)

  1. Extract latest approved parameters (SQL).
  2. Generate production forecasts (Spark decline expansion).
  3. Launch Monte Carlo batches (Spark jobs or HPC submissions).
  4. Aggregate distributions (Spark reduce / SQL summarize).
  5. Apply real options overlay (Python step).
  6. Persist summary percentiles (Hive + SQL Server).
  7. Notify stakeholders (email distribution).

8. Reporting & Analytics

  • Executive BI: Spotfire / Tableau / early Power BI (NPV P10/P50/P90, tornado drivers).
  • Ad hoc: Excel pivots via ODBC to curated SQL views.
  • Storage optimization: Percentiles persisted; optional compressed binary full-path arrays (NFS) for audit replay.
  • Delta comparisons: SQL view diffing latest run vs prior approved baseline.

9. Integration with Business Units

DomainInbound FlowProcessingOutput Target
GeologyPetrel / Eclipse exportsETL parseHive reservoir summaries
DrillingDaily reports (PDF→manual XLS)Structured captureSQL + Hive staging
Production OpsHourly SCADA snapshotKafka / batch ingestEvent log / downtime model
MarketingMonthly price deck ExcelControlled ETL & approvalprice_deck_version
FinanceERP CAPEX/OPEX nightlySqoop importCost tables / allocation feeds
RiskCorrelation matrixImport to risk_correlation moduleSimulation inputs
Land / InfrastructureTariff & capacity schedulesDirect table loadsConstraint models

10. Security & Access

  • Hadoop: Kerberos (where maturity allowed) else OS ACLs.
  • SQL Server roles: FINANCE_READ, GEO_WRITE, ECON_ADMIN.
  • File share oversight: Manual audits (Windows DFS / NFS export lists).

11. Governance & Data Quality

  • Python DQ scripts logging exceptions to dq_issue (rule_id, entity, severity, timestamp).
  • Monthly DQ review & remediation tracking (manual register).
  • Lineage approximation: ETL job logs + timestamped folder patterns (no full automated lineage tooling yet).

12. Carbon / ESG (Early Stage)

  • Emission factor reference table (fuel gas, flaring, venting).
  • Static carbon price scenarios (low / base / high) feeding OPEX line.
  • No dynamic abatement / marginal cost optimization at that stage.

13. Real Options Implementation Constraints

  • Compute elasticity limited → overnight batch window; option runs weekly not continuous.
  • Focused scope: delay sanction, phased expansion, early abandonment (highest value density).
  • Longstaff–Schwartz regression on reduced path sample (≈2k) to conserve resources.

14. Performance Practices

  • Pre-aggregate well groups to reduce simulation dimensionality.
  • Spark broadcast for small dimension joins.
  • Persist intermediate Monte Carlo artifacts (sampled price paths) for reuse.
  • Convergence guardrail: escalate run count only when ENPV band instability > tolerance.

15. Monitoring & Logging

  • Scheduler logs (Oozie) + run_log status, duration, record counts.
  • Simple email alerts on failure; optional ELK (if PoC adopted) but not standardized.

16. Migration Path / Future Evolution

Evolution StepRationaleOutcome Target
Broader Airflow adoptionTransparency & modular DAG mgmtReplace opaque XML workflows
MLflow / experiment trackingTrace model / parameter evolutionReproducibility maturity
Object store full-path persistenceRicher post-hoc analyticsScenario drill-down, re-simulation
Metadata / lineage catalog (DataHub etc.)Governance & impact analysisFaster change assessment
Containerization & elasticityReduce queue latencyOn-demand scaling

17. Architectural Summary Diagram

flowchart TD
  Sources["`Sources`"]
  Raw["`Raw Zone (HDFS)`"]
  Curated["Curated (Hive)"]
  Marts["Marts (Impala/Presto)"]
  Reservoir["Reservoir Grids"]
  Engines["Economic / Option Engines (Py/R)"]
  Outputs["Simulation Outputs (Hive + SQL)"]
  BI["Reporting / BI"]

  Sources --> Raw --> Curated --> Marts
  Raw --> Reservoir
  Curated --> Engines
  Engines --> Outputs --> BI
  Reservoir --> Outputs

18. Key Constraints vs Modern Design

2015–2018 LimitationModern Remedy
Limited elasticity (static cluster)Serverless / autoscaling compute (Spark on k8s, Databricks)
Manual catalog & lineageActive metadata graph (DataHub, OpenLineage)
Weekly options cycleEvent-driven incremental valuation
Percentiles-only storageColumnar archival of full paths (Parquet / Delta)
Ad hoc approval workflowPolicy-driven parameter governance platform

Historical architecture reference; complements the upstream evaluation modeling framework.