Generalized K-Means Clustering for Apache Spark with Bregman Divergences
## Body (3,982 characters)
I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.
*Why use this instead of Spark MLlib?*
MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:
- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions.
- *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB.
- *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data.
- *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).
Using the wrong divergence yields mathematically valid but semantically meaningless clusters.
This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!
Generalized K-Means Clustering for Apache Spark with Bregman Divergences
## Body (3,982 characters)
I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.
*Why use this instead of Spark MLlib?*
MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:
- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions.
- *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB.
- *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data.
- *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).
Using the wrong divergence yields mathematically valid but semantically meaningless clusters.
This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!
The failure of the Zenobia Pay encodes much of the REAL value of Mastercard and Visa:
1. A global, enforceable rulebook + dispute court.
They standardize how authorization, clearing, settlement, chargebacks, retrievals, representments, reason codes, and evidence work—and they arbitrate when parties fight. That governance is why a corner bodega and a transnational airline can both accept the same credential. (Read the rulebooks; they’re huge, living specs.)
2. Credible liability commitments that change customer behavior.
Zero-liability and liability-shift regimes make consumers fearless and pressure merchants to adopt secure tech (EMV, 3-DS). Fearless buyers = higher conversion. That demand-side boost is the engine of card commerce.
3. Tokenized, portable identity for payments.
Network tokens (EMVCo) and wallet provisioning (Apple Pay/Google Pay via DPANs) are the reason card data can live safely in phones, browsers, and vaults. This reduces breach externalities and keeps the credential working when plastic changes. That’s not ACH.
4. Compliance offload and ecosystem discipline.
PCI exists so the brands don’t directly police every merchant’s infosec day-to-day—yet they still set the bar and yank privileges when needed. It’s governance as a service.
5. Programmable payout rails on the same credential.
They’re not just purchase networks anymore. Push-to-card (Visa Direct/Mastercard Send) rides the acceptance footprint for disbursements, wage advances, gig payouts, and remittances—instantly, to billions of cards. That makes the card a universal endpoint for money-in and money-out.
6. Regulatory navigation and durability.
Interchange caps and business-rule constraints (EU IFR; U.S. Durbin/Reg II) didn’t kill them; they adapted by shifting economics across scheme fees, value-added services, and routing. Survivability under hostile policy is part of the value.
7. They own the “choice architecture.”
Historically, anti-steering rules protected fee levels; those were curtailed, but the lesson stands: control over how credentials are presented and preferred at checkout is leverage. (See the AmEx case for the legal theory on two-sided markets and steering.)
What they don’t do (important)
• They don’t issue credit or carry most fraud losses—that’s issuers. Networks set rules and move bits; issuers/acquirers take primary financial exposure and then sling chargebacks through the network’s process. (Still: the rules are the value.)
• They aren’t the only rails that can scale: account-to-account can win when the state or banks coordinate (Pix, UPI, iDEAL). Those systems prove rails alone can beat cards on price and UX—if you also deliver governance and adoption.
Where they’re vulnerable next (and already hedging)
• A2A/instant schemes (Pix, UPI, iDEAL) are re-wiring consumer habits. If U.S. open banking + FedNow/RTP ever gets real UX and liability parity, cards will feel it. Meanwhile, Visa/MC are buying into open banking to stay the orchestration layer (Visa–Tink; Mastercard–Finicity).
• Checkout is being intermediated by wallets and platforms. Apple/Google own the front door; card brands keep the credential alive via network tokens, but UX power is shifting up-stack. Tokenization keeps them relevant; control of the UI does not necessarily stay with them.
• Policy pressure keeps grinding down interchange/steering constraints. They can adapt, but the rent skim is under scrutiny—again.
The blunt summary
Visa and Mastercard don’t win because they’re the fastest rail or the cheapest. They win because they govern trust at scale: a portable identity (token), a standardized contract (rules), and a credible promise about who pays when things go wrong (liability). That cocktail reliably boosts conversion for merchants and confidence for consumers. Until an alternative can match all four—rail + rules + identity + liability—cards remain the default operating system for commerce.
AI has sparked new interest in high dimensional embeddings for approximate nearest neighbor search. Here is a highly scalable, implementation of a companion technique, k-means clustering that uses Spark 1.1 written in Scala.
Please let me know if you fork this library and update it to the latter versions of Spark.
ChatGPT
This is a result from the "dig" command, a DNS (Domain Name System) tool used to query DNS servers and retrieve information about domain names. Let's break down the output:
dig: This is the command being executed.
+human: This flag is not a standard dig option. It is likely an extension that is meant to provide a more human-readable output, making it easier for people to understand the results.
+norecurse: This flag tells the DNS server (in this case, the server at IP address 8.8.8.8, which is a public Google DNS server) not to perform recursive lookups. Recursive lookups involve the DNS server querying other DNS servers if it doesn't have the requested information in its cache. By using +norecurse, the server will only provide information if it has it directly available.
@8.8.8.8: This indicates the DNS server being queried, which is the Google Public DNS server located at IP address 8.8.8.8.
google.com: This is the domain name being queried.
Now, let's look at the different sections of the response:
HEADER: This section provides general information about the response.
opcode: QUERY: Indicates that the query type is a standard query.
status: NOERROR: Indicates that the query was successful, and there were no errors.
id: 11653: This is a unique identifier for the query, used to match queries with their respective responses.
flags: qr ra: The flags indicate that this is a response (qr) and that recursion is available and allowed (ra).
QUESTION SECTION: This section shows the domain name and the type of record being queried.
google.com. IN A: The domain name "google.com" is being queried for its IPv4 address (A record).
ANSWER SECTION: This section provides the answer to the query.
google.com. 21 IN A 172.217.4.206: This line shows the answer. It says that the IPv4 address of "google.com" is 172.217.4.206. The "21" is the time to live (TTL) value in seconds, which indicates how long the information can be cached by other DNS servers or clients before it should be refreshed.
ADDITIONAL SECTION: This section provides additional information related to the query.
EDNS: version: 0, flags:; udp: 512: This line indicates the use of EDNS (Extension Mechanisms for DNS), which is a set of extensions to the DNS protocol. The version is 0, and the flags are empty. "udp: 512" refers to the maximum payload size that can be used for UDP communication.
In summary, the query asked for the IPv4 address (A record) of "google.com" from the Google Public DNS server at IP address 8.8.8.8. The response confirms that the IPv4 address is 172.217.4.206, and it is valid for 21 seconds.
Two softball players. One can hit the ball an average of 230 feet, 40% of the at bats. The other can hit the ball an average of 210 feet, 40% of the at bats. The homerun wall is 220 feet.
One is a GREAT homerun hitter. The other has a poor batting average.
The issue is that the success measure is non-linear.
Yup. Also QNX native networking protocol. QNX's basic networking primitive is a remote procedure call. So there's a message-oriented network protocol underneath. It can be run either on top of UDP or directly at the IP level.
## Title
Generalized K-Means Clustering for Apache Spark with Bregman Divergences
## Body (3,982 characters)
I've built a production-ready K-Means library for Apache Spark that supports multiple distance functions beyond Euclidean.
*Why use this instead of Spark MLlib?*
MLlib's KMeans is hard-coded to Euclidean distance, which is mathematically wrong for many data types:
- *Probability distributions* (topic models, histograms): KL divergence is the natural metric. Euclidean treats [0.5, 0.3, 0.2] and [0.49, 0.31, 0.2] as similar even though they represent different distributions. - *Audio/spectral data*: Itakura-Saito respects multiplicative power spectra. Euclidean incorrectly treats -20dB and -10dB as closer than -10dB and 0dB. - *Count data* (traffic, sales): Generalized-I divergence for Poisson-distributed data. - *Outlier robustness*: L1/Manhattan gives median-based clustering vs mean-based (L2).
Using the wrong divergence yields mathematically valid but semantically meaningless clusters.
*Available divergences:* KL, Itakura-Saito, L1/Manhattan, Generalized-I, Logistic Loss, Squared Euclidean
*What's included:* - 6 algorithms: GeneralizedKMeans, BisectingKMeans, XMeans (auto k), SoftKMeans (fuzzy), StreamingKMeans, KMedoids - Drop-in MLlib replacement (same DataFrame API) - 740 tests, deterministic behavior, cross-version persistence (Spark 3.4↔3.5, Scala 2.12↔2.13) - Automatic optimization (broadcast vs crossJoin based on k×dim to avoid OOM) - Python and Scala APIs
*Example:*
```scala // Clustering topic distributions from LDA val topics: DataFrame = // probability vectors
// WRONG: MLlib with Euclidean new org.apache.spark.ml.clustering.KMeans() .setK(10).fit(topics)
// CORRECT: KL divergence for probabilities new GeneralizedKMeans() .setK(10) .setDivergence("kl") .fit(topics)
// For standard data, drop-in replacement: new GeneralizedKMeans() .setDivergence("squaredEuclidean") .fit(numericData) ```
*Quick comparison:*
| Use Case | MLlib | This Library | |----------|-------|--------------| | General numeric | L2 | L2 (compatible) | | Probability distributions | Wrong | KL divergence | | Outlier-robust | | L1 or KMedoids | | Auto k selection | | XMeans (BIC/AIC) | | Fuzzy clustering | | SoftKMeans |
*Performance:* ~870 pts/sec (SE), ~3,400 pts/sec (KL) on modest hardware. Scales to billions of points with automatic strategy selection.
*Production-ready:* - Cross-version model persistence - Scalability guardrails (chunked assignment) - Determinism tests (same seed → identical results) - Performance regression detection - Executable documentation
GitHub: https://github.com/derrickburns/generalized-kmeans-clusterin...
This started as an experiment to understand Bregman divergences. Surprisingly, KL divergence is often faster than Euclidean for probability data. Open to feedback!