fantastic-spork

fantastic-spork is a Scala library for Apache Spark, providing extra high-performance SQL functions for easy text and array operations. Designed to boost productivity by extending built-in Spark SQL and DataFrame capabilities.

Purpose

In real-world analytics, Spark users often need to do things like count substrings, tally words in collections, or process text—tasks not always convenient with Spark’s built-in SQL functions. fantastic-spork delivers production-ready, native Catalyst expressions for these cases, ensuring top Spark performance and seamless integration.

More efficient than regular Scala UDFs
Convenient SQL extensions
Composable for DataFrame, Dataset, and SQL APIs

Setup

Artifacts are published on Maven Central.

SBT

libraryDependencies += "io.gitlab.rokorolev" % "fantastic-spork_2.12" % "0.1.2"

Maven

<dependency>
    <groupId>io.gitlab.rokorolev</groupId>
    <artifactId>fantastic-spork_2.12</artifactId>
    <version>0.1.2</version>
</dependency>

spark-submit

io.gitlab.rokorolev:fantastic-spork_2.12:0.1.2

Usage

import io.gitlab.rokorolev.fantasticspork.functions._

Available functions:

count_words_in_array(): Returns a map of word counts from an array of strings.
count_substring(): Counts occurrences (including overlapping) of a substring within a string.

Use these functions in your Spark SQL or DataFrame code.

Example

// Spark SQL example
spark.sql("SELECT count_substring('abracadabra', 'abra')") // returns 2

// DataFrame example
df.select(expr("count_words_in_array(array('apple', 'banana', 'apple'))"))

Contributing

Fork the repo on GitLab
Create a branch: git checkout -b my-feature-branch
Commit your changes: git commit -am 'Add new feature'
Push the branch: git push origin my-feature-branch
Open a Merge Request, with details

Building and Publishing

This project uses sbt for building and publishing. Artifacts go to Maven Central via Sonatype, using sbt-sonatype and sbt-pgp.

Build locally

sbt clean compile test

Publishing to Maven Central

Before publishing, set up your PGP keys and Sonatype credentials in ~/.sbt/1.0/sonatype.sbt:

credentials += Credentials(
  "Sonatype Nexus Repository Manager",
  "central.sonatype.com",
  "<your-username>",
  "<your-password>")

Release steps:

Clean, test, and sign:
sbt +clean +test +publishSigned
Release the staging repo:
sbt sonatypeBundleRelease

See sbt-sonatype docs and sbt-pgp docs for details.

Thank you for considering contributing to fantastic-spork!
Fantastic Spork Repository