This post captures the practical architecture (2015–2018) that supported the upstream evaluation & stochastic modeling framework outlined in the related post: Upstream Asset Evaluation Framework. It maps legacy design choices to modern terminology and highlights constraints that shaped modeling workflows.
1. Core Principles#
- On‑prem / hybrid HPC + Hadoop (YARN) cluster for heavy simulation; limited early cloud (select AWS EC2/EMR/S3; occasional Azure VM/Blob).
- No unified “lakehouse” yet: layered zones → Raw (HDFS/S3) → Curated (Hive/Parquet/ORC) → Marts (Hive/Impala/Presto).
- Limited containers/Kubernetes; batch schedulers dominated (Oozie, early Airflow pilot, Control‑M, Cron).
- Governance largely manual: Hive Metastore + ad hoc catalog (Excel / SharePoint / SQL).
2. Data Ingestion#
| Source Type | Examples | Mechanism | Notes |
|---|
| Geoscience | LAS, SEG-Y | Batch file drop + ETL parse | Large binary + metadata extraction |
| Well / Ops | WITSML feeds | Batch pull / scheduled parse | Standardization step into Hive |
| ERP / Finance | CSV / RDBMS exports | Sqoop (RDBMS→HDFS), SSIS, Python/.NET ETL | Controlled nightly cadence |
| SCADA / Events | Downtime logs | Kafka 0.8/0.9 (where deployed) or Flume/Logstash | Early streaming footprint |
| Market / Pricing | Excel price decks | Staged in SQL then approved to config tables | Manual approval workflow |
Workflow orchestration: Oozie XML workflows early; selective Airflow DAGs (late 2017–2018) for transparency and dependency visualization.
- Raw Zone: HDFS (gzip/original form) + limited S3 buckets (hybrid tests).
- Curated Zone: Hive tables (Parquet/ORC), partitioned (date, asset, well_id).
- Reservoir simulation outputs: Eclipse / CMG binary grids + summary CSV parsed into Hive fact tables.
- Master data (wells, facilities, contracts): SQL Server / Oracle system-of-record, mirrored to Hive dims.
4. Processing & Modeling Stack#
| Layer | Tools / Tech | Purpose |
|---|
| Batch Parallel | Spark 1.6 → 2.x | Monte Carlo sample gen, decline curve expansion, large joins |
| Statistical / Valuation | Python 2.7/3.5 (NumPy, SciPy, Pandas) | Economic metrics, real options calc |
| Advanced Stats | R (forecast, variance decomposition) | Specialized fitting & diagnostic tests |
| HPC | SLURM / PBS / GridEngine | Reservoir realizations & large path batches |
| Options | Custom Python/R (binomial, Longstaff–Schwartz) | Option premium computations |
| Correlation Sampling | Python (Cholesky) | Reproducible factor draws |
Random seeds & correlation matrices persisted to control tables for reproducibility.
5. Fiscal & Economic Engine#
- Core cash waterfall: SQL stored procedures (SQL Server) or Python module for flexibility.
- Scenario parameter schemas:
price_deck, capex_curve, opex_class, tax_regime, option_trigger. - Run logging:
run_log table referencing immutable parameter version rows.
6. Parameter & Assumption Management#
- Version tables:
reserves_version, price_deck_version, cost_curve_version. - Approval workflow: SharePoint / email sign-off toggling
approved=Y flag. - Identity via composite keys (model_version + parameter_version_ids); minimal hashing at the time.
7. Workflow Orchestration (Typical DAG)#
- Extract latest approved parameters (SQL).
- Generate production forecasts (Spark decline expansion).
- Launch Monte Carlo batches (Spark jobs or HPC submissions).
- Aggregate distributions (Spark reduce / SQL summarize).
- Apply real options overlay (Python step).
- Persist summary percentiles (Hive + SQL Server).
- Notify stakeholders (email distribution).
8. Reporting & Analytics#
- Executive BI: Spotfire / Tableau / early Power BI (NPV P10/P50/P90, tornado drivers).
- Ad hoc: Excel pivots via ODBC to curated SQL views.
- Storage optimization: Percentiles persisted; optional compressed binary full-path arrays (NFS) for audit replay.
- Delta comparisons: SQL view diffing latest run vs prior approved baseline.
9. Integration with Business Units#
| Domain | Inbound Flow | Processing | Output Target |
|---|
| Geology | Petrel / Eclipse exports | ETL parse | Hive reservoir summaries |
| Drilling | Daily reports (PDF→manual XLS) | Structured capture | SQL + Hive staging |
| Production Ops | Hourly SCADA snapshot | Kafka / batch ingest | Event log / downtime model |
| Marketing | Monthly price deck Excel | Controlled ETL & approval | price_deck_version |
| Finance | ERP CAPEX/OPEX nightly | Sqoop import | Cost tables / allocation feeds |
| Risk | Correlation matrix | Import to risk_correlation module | Simulation inputs |
| Land / Infrastructure | Tariff & capacity schedules | Direct table loads | Constraint models |
10. Security & Access#
- Hadoop: Kerberos (where maturity allowed) else OS ACLs.
- SQL Server roles:
FINANCE_READ, GEO_WRITE, ECON_ADMIN. - File share oversight: Manual audits (Windows DFS / NFS export lists).
11. Governance & Data Quality#
- Python DQ scripts logging exceptions to
dq_issue (rule_id, entity, severity, timestamp). - Monthly DQ review & remediation tracking (manual register).
- Lineage approximation: ETL job logs + timestamped folder patterns (no full automated lineage tooling yet).
12. Carbon / ESG (Early Stage)#
- Emission factor reference table (fuel gas, flaring, venting).
- Static carbon price scenarios (low / base / high) feeding OPEX line.
- No dynamic abatement / marginal cost optimization at that stage.
13. Real Options Implementation Constraints#
- Compute elasticity limited → overnight batch window; option runs weekly not continuous.
- Focused scope: delay sanction, phased expansion, early abandonment (highest value density).
- Longstaff–Schwartz regression on reduced path sample (≈2k) to conserve resources.
- Pre-aggregate well groups to reduce simulation dimensionality.
- Spark broadcast for small dimension joins.
- Persist intermediate Monte Carlo artifacts (sampled price paths) for reuse.
- Convergence guardrail: escalate run count only when ENPV band instability > tolerance.
15. Monitoring & Logging#
- Scheduler logs (Oozie) +
run_log status, duration, record counts. - Simple email alerts on failure; optional ELK (if PoC adopted) but not standardized.
16. Migration Path / Future Evolution#
| Evolution Step | Rationale | Outcome Target |
|---|
| Broader Airflow adoption | Transparency & modular DAG mgmt | Replace opaque XML workflows |
| MLflow / experiment tracking | Trace model / parameter evolution | Reproducibility maturity |
| Object store full-path persistence | Richer post-hoc analytics | Scenario drill-down, re-simulation |
| Metadata / lineage catalog (DataHub etc.) | Governance & impact analysis | Faster change assessment |
| Containerization & elasticity | Reduce queue latency | On-demand scaling |
17. Architectural Summary Diagram#
flowchart TD
Sources["`Sources`"]
Raw["`Raw Zone (HDFS)`"]
Curated["Curated (Hive)"]
Marts["Marts (Impala/Presto)"]
Reservoir["Reservoir Grids"]
Engines["Economic / Option Engines (Py/R)"]
Outputs["Simulation Outputs (Hive + SQL)"]
BI["Reporting / BI"]
Sources --> Raw --> Curated --> Marts
Raw --> Reservoir
Curated --> Engines
Engines --> Outputs --> BI
Reservoir --> Outputs
18. Key Constraints vs Modern Design#
| 2015–2018 Limitation | Modern Remedy |
|---|
| Limited elasticity (static cluster) | Serverless / autoscaling compute (Spark on k8s, Databricks) |
| Manual catalog & lineage | Active metadata graph (DataHub, OpenLineage) |
| Weekly options cycle | Event-driven incremental valuation |
| Percentiles-only storage | Columnar archival of full paths (Parquet / Delta) |
| Ad hoc approval workflow | Policy-driven parameter governance platform |
Historical architecture reference; complements the upstream evaluation modeling framework.