fantastic-spork is a Scala library for Apache Spark, providing extra high-performance SQL functions for easy text and array operations. Designed to boost productivity by extending built-in Spark SQL and DataFrame capabilities.
In real-world analytics, Spark users often need to do things like count substrings, tally words in collections, or process text—tasks not always convenient with Spark’s built-in SQL functions. fantastic-spork delivers production-ready, native Catalyst expressions for these cases, ensuring top Spark performance and seamless integration.
Artifacts are published on Maven Central.
libraryDependencies += "io.gitlab.rokorolev" % "fantastic-spork_2.12" % "0.1.2"
<dependency>
<groupId>io.gitlab.rokorolev</groupId>
<artifactId>fantastic-spork_2.12</artifactId>
<version>0.1.2</version>
</dependency>
io.gitlab.rokorolev:fantastic-spork_2.12:0.1.2
import io.gitlab.rokorolev.fantasticspork.functions._
Available functions:
count_words_in_array()
: Returns a map of word counts from an array of strings.count_substring()
: Counts occurrences (including overlapping) of a substring within a string.Use these functions in your Spark SQL or DataFrame code.
// Spark SQL example
spark.sql("SELECT count_substring('abracadabra', 'abra')") // returns 2
// DataFrame example
df.select(expr("count_words_in_array(array('apple', 'banana', 'apple'))"))
git checkout -b my-feature-branch
git commit -am 'Add new feature'
git push origin my-feature-branch
This project uses sbt for building and publishing. Artifacts go to Maven Central via Sonatype, using sbt-sonatype and sbt-pgp.
sbt clean compile test
Before publishing, set up your PGP keys and Sonatype credentials in ~/.sbt/1.0/sonatype.sbt
:
credentials += Credentials(
"Sonatype Nexus Repository Manager",
"central.sonatype.com",
"<your-username>",
"<your-password>")
Release steps:
sbt +clean +test +publishSigned
sbt sonatypeBundleRelease
See sbt-sonatype docs and sbt-pgp docs for details.