Java / JVM development trends for data projects


If Arrow and cuDF remain relevant over the next few years, something similar will happen for those as well

This seems very interesting to me because it make the polyglot ecosystem more real.
And the JVM side needs this to get closer to action in the big data products world.

@wesm please advise on your take on this strategy for #ApacheArrow.

1 Like

As far investment in open source outside the JVM, I’ve seen rumblings that Databricks is rewriting some of their engine in C++ to do optimized SQL things. I’m not sure what their reasoning was for building this outside the JVM… guessing maybe they couldn’t get the speed they needed from the JVM? Only thing I’ve seen on Photon is this job posting though:

Sorry I skipped the point I was trying to make here.

Let’s say there is a measurable tend of JVM based data OSS vs. non-JVM based, and that trend is measured by some holistic view of public repositories. If companies the size of Databricks are actively developing off-JVM in private, the insights to that trend are going to seriously lag behind. We’ll look back on the trend and say “oh yeah it’s obvious now that the giant shift was taking place all along behind the scenes”. But what you’ll actually be measuring is adoption of these projects once they are released publicly and by the time you notice a trend you’ll be behind the innovation wave.

@nicpk we’ve been doing this for years now. There’s been a ton of initiatives to do a lot of performance sensitive math work off heap. Look no further than what @wesm and co are doing with gandiva as well.

When I built the first version of our “numpy for java” back in 2014, this was after trying to use every single existing matrix library from netlibjava (which spark still relies on for some reason despite being unmaintained at this point) and jblas (also unmaintained for the most part) trying to keep up with the performance of c. it just wasn’t working. Both libs had the same issues: on heap memory and trying to use java for something that it isn’t meant for at a certian scale.

The JVM has some great integrations with the “big data” ecosystem due to netty and co doing native performance optimization behind the scenes. None of these trends are really new. For anything that matters, you generally still use some form of JNI.


Indeed, quite interesting, glad you think so (too).
In this context and the context of my post above on “Scala/Java as DSLs”, what do you think of “Samsara: Declarative #MachineLearning on Distributed Dataflow Systems” : ?

Would it not be great to use a/this DSL for (distributed) linear algebra?

The mahout guys are using samsara. Netlib and its ilk aren’t distributed.

It’s just an interface to blas. BLAS itself barely implements anything beyond matrix multiply and some other key operations you use as building blocks. It’s still an exercise left to the reader if you want most of a “numpy” you need a lot more. That’s partially why we wrote nd4j.

The ideas are good but it’s restricted to matrices and lacks ops and autodiff as well as interop with most popular libraries people use (onnx, tf at a minimum).

I remember when those guys were starting to work on samsara a few years ago. It was a great idea, and I had talked with them about using it for deep learning, but we’ve mostly needed data paralellism in practice. Model parallelism (a matrix across devices) can be built on top of that which is interesting. Beyond that, it seems like things fell behind a bit.

For pure play ML honestly samsara and mahout are great. As for samsara/spark, I’m guessing it’s just a “we already wrote this” and it’s not worth it for either side to do the work of integrating more formerly with the other.

I would like to see some potential benchmarks now as well as where gpgpu (cuda mainly) is. Based on an initial read it seems like most things are built in java. As discussed in this thread, c++ for the math is a hard requirement to get any perf to be competitive with python now a days.

Personally, the dl4j team is interested in solving this problem with flink/spark etc. Both have k8s integration now, and k8s has proper gpu support now. Problem is, if you look at the collective effort to migrate/update the matrix math libraries to something maintained, most efforts have failed:

We have some interest in properly supporting these things. It’s mostly a resource/funding/priority thing for us. I’m sure it’s the same for apache samsara. I will not speak for them of course though :slight_smile: I’m hoping this forum will attract an open discussion.
We’re slowly starting a discussion on how to standardize matrices on the jvm:!forum/java-ndarray-standardization

Tensorflow sig jvm, AWS DJL/Apache MxNet, and dl4j are all trying to see if we can start building momentum around a shared standard.

So to summarize “matrix math on the jvm” is moving and it has a c++ backend, python interop, off heap memory, autodiff, and blas operations.

Apache arrow’s tensor data type is also really exciting. I would be curious to see if a core multi lingual framework can be built on top of that, maybe with a pluggable compute backend?

@apalumbo you would know better than me I believe. I’d love to hear an update on where things are.


Very good to know.
Please share lots more resources / evidence for our education, @agibsonccc, @apalumbo

BTW: do you mean matrix math/algebra or full tensor algebra ?

hey all, I apologize for the late response, I’ve been under the weather and back and forth to the Dr a few times.

@apalumbo you would know better than me I believe. I’d love to hear an update on where things are.

Yes, I’ll give over a quick overview of the layers of abstraction of mahout, the abstract DistributedRowMatrix.scala, DrmLkie[K] [1] and the In-core[11] , makes it easy to test different memory and operator types for Sparse and or dense Matrix Algebra. We currently have operable back-ends for Spark[3] [for production], H2O[BATCH ONLY][[passes tests]], not currently maintained,original bindings built in 2015 and Flink. Though we have tested GPU backing of the in-core matrices on all three architectures.

BTW: do you mean matrix math/algebra or full tensor algebra ?

@SemanticBeeng At the moment, I am speaking about 2d matrix algebra, but i believe that mahout is flexible enough to be extended to full tensor algebra.

For starters, we have a Distributed Row Matrix of Type [T], which contains a blockified Distributed matrix of row key type K , DRMLike[K], it is abstract [1], and then extended for each engine In this case for Spark [2]:

RDD[K][(K, in-coreA)]

Problem is, if you look at the collective effort to migrate/update the matrix math libraries to something maintained, most efforts have failed

Same with mahout, we have working bindings just not production grade, See below for explaination:

  * <-- biggest problem 
  * Several more Filtered:

We actually have bindings for Flink which met our standard: in this case we used for DRMlike[K] flink batch Dataset[K] [2]:

where the Dataset was distributed as Dataset[([K], In-core)

For H2O we had a mixed java/scala module[4][5][6]:

Using Chunks of an H2O DataFrame with fields K, In-core matrices.

 Frame frame = drm.frame;
 Vec labels = drm.keys;

Spark is the only production grade backend, and that was kind of an evolutionary decision, we set up the bindings for H2O, and they fell to wayside (as H2O) used their own methods to ride on Spark, etc this was planned, and determined by user’s interest, and willingness to contribute.

As for Flink, several of from the Apache Mahout, Apache Flink and TU Berlin and a full time grad student from TU Berlin, did get an MVP after a good amount of time. Our criteria for accpting a new Engine backend was for it to pass all of our abstract math-scala tests [7].

After a good amount of time both teams acknowledged the the Flink Distribution was lacking In-memory caching capabilities, and did not have memory checkpointing at the time this is Flink Batch, - 1.1.4. they may have since,fixed this issue. and check-pointing (off the top of my head) would not allow for out “Distributed Stochastic SVD” (DSSVD) even in unit tests using very small matrices to start. We discussed with the heads of the project that unfortunately until these issues (a cache and checkpointing were available that a declarative DSL for Distributed BLAS would not work with mahout)…The biggest issue was that with Flink each expression in logical graph was early executed, This caused huge memory issues for any of the highly iterate algorithm or operation. H however I at the time was looking into methods of implementing this with a few different pieces of caching software, but just didn’t have the time. as well this was Spring of 2016, and i have no idea how far the batch side of their project has come. This may have been fixed for already.

We then Implemented OpenCL bindings [8], and OpenMP Bbndings [9], by extending the in-core matrices to offload to CSC matrices onto GPU (or off heap memory in the case OpenMP) from the ViennaCL[6] library, we again essentially put out an MVP. We used JavaCPP to wrap the dgemm and sgemm libraries for BLAS III and BLAS |I though only the DRMLike[K] %*% DRMLike[K] or DRMLike[K] %*% Vector we did not allow for a scalars as we were simply overriding the memory and compute placement of the engine agnostic’s in-core matrices. ( @agibsonccc, @saudet, this was around the time that we were chatting with you guys alot. We couldn’t get the parser going with the Vienna-Cl library as it was header only C++, (which as i remember, we could Have and we couldn’t use nd4J for the back end due to licensing issues IIRC… We started with BLAS as it was important for DSSVD, which we were showcasing quite often. The plan was to go on to more primitives, and more on-GPU or off heap OpenMP/MKL matrix operations.

Though we have a plan more elegant plan to determine to dynamically upon triggering of the execution of the physical plan containing Distributed Matrix Multiplicatiopn ; %*%:

DrmLike[K] + %*% ...

Using the Samsara DSL, The back end engine will, for each backend machine determine the where to offload the each in-core matrices’ backing data eg. main memory for OpenMP backed sparse or dense matrix multiplication or to GPU_Memory, tetested on NVIDIA and seen with approximate 30x speedups. Net of copy time. As is currently implemented, we simply check for the existence of the class e.g. viennacl.GPUMMUL we must usenbe on the backed.CLASSPATH so that wee may execute the plan to use on the for any given engine.

 **pseudocode, (each backend)

       OpA == %*%, (Operands:  A: DrmLike[K], B  DrmLike[K]) => match  {

          case GPUMMul.scala == Some: drmC =        
                        org.apache.ahout.viennacl.GPUMMUL)](A, B) 
          case ViennaCL.OMPMMu == some: 
                        org.apache.ahout.viennacl.OPUMMUL)](A, B)  // of heap with OMP support

          case _  org.apache.mahout.sclabindings.RLikeDrmOps.MMMuL(A, B) //. JVM

PLEASE NOTE this is not the preferred method of using JavaCPP. We did not run the full parser on the ViennaCL Library. Most is run clasds by class and we converted only what we needed. written based off of JavaCPP’s methodology. Please read the JavaPeset s closely when using JavaCPP this is not a a good example.

I.e. if the mahout.viennacl.GPUMMUL)]has been added to the backend class the backed classpath on the executing node for the given engine, precompiled for the targeted host and routed to the pertinent functions native device functions, and helper functions, if on class-path use the overridden MMuL.sclaa Class to to determine the sparsity of the each Drm’s In- core blocks using a pre-set %-density (.002% is the default IIRC) to a given degree degree confidence, execute pre-compiled C++ code on the specified device(s); if not check for mahout.viennacl-openMP if the viennaclOMPMMUL.Class is on the class path, offload memory to off heap Main memory out and execute the plan using pre-built C++ binaries as specified in the OpenMP properties., executed in native VienaCl BLAS III natively with the result, a pointer to drmC passed back to the host memory to be either copied back into a an In-core Matrix to be passed back to the resulting DRMLike, or to remain in place in memory to be used in a further operation.

We have a plan for something much more elegant. There are further optimizations from there.

One of the next steps, as well was to keep a pointer to inCoreC where inCoreC := inCoreA %*% inCoreB in GPU memory, avoiding extra copies to and from host, and just to evaluate the entire expression in (multi)GPU memory… however if working with Apache Arrow, this may be even easier, as well cuda uniform memory should be of use here.

I will give a better rundown in the next days, when an expression is passed to each engine, the DSL has a Logical Optimizer which will recursively dive into a logical plan, and rewrite is in a more logically optimal way in order to e’.g. redundant operations or rewrite anything that causes a memory shuffle of some other I/O intensive case. The most common and intuitive case is the Re-Wrwite of

 val drmX =dDrmA.T %*% drmA

Rather than actually going through the steps of transposing a Fully didgtributed matrix and then multiplying it by a copy of it self, Mahout computes ATA via row-outer-product formulation.

Thes are not the only features of Mahout, the DSL has a full skit-learn style transform, fit, predict style of pipeline, which makes fast prototyping of high level distributed algorithms simple on any engine and device combo implemented. All it takes to Implement a new engine is to extend the math-scala library, reproducing in a new module for a new engine and /ore a new device. As Well, the back end closures, On Spark at least are open to native applications, and Python applications can run from the closures.

So to summarize “matrix math on the jvm” is moving and it has a C++ backend, python interop, off heap memory, autodiff, and blas operations.

I agree with this. I’ve been actually working on something similar for which I was eventually planing to optimize matrices in python [10] [WIP] to look at python memory interaction had to put aside before even getting a good build. but i later came to find out that it was sue to some firmware issues with the first Jetson Nano. This is really just a cheap PIPE from Python -> Java at first, eventually will form a queue of in-core mahout CUDA/GPU backed matrices, or a stream, but I may change it to a protobuf/gRPC (arrow/flight?) connection if time permits. At first but will be using both NiFi and MiNiFi(Java) and the RapidsAI stack to process SDR I/Q streams. I’ve been planning on comparing several different back-ends to the mahout In-Core interface[11] , and presenting in a Notebook. I think that there are some easy wins. At the moment I don’t have mahout in this project yet, but I am using it as a method to start thinking about Streaming and more Optimizations.

As well we do have a ticket for, and plan on Looking at arrow as a back end. [12]

All this to say that Mahout is easily extensible, at the matrix memory level, and at the Engine level and even at the serialization level.

The Mahout stack abstracted as so:

with the Higest level being a scala DSL with a fit predict framework, and several R -like functions which make a distributed thin-qr decomposition may be implemented in a few lines:

Or see DSSVD.

At the lowest level implementation, the in-core-matrix and matrix operations can be written in any language, run on any device, as long as the pass the test at the highest bevel’s math-scala module [13] for another example of an algorithm of written in mahout’s top level, which will run on any engine, well implemented, and any device/off heap operation if correctly extended the in-core matrix and all necessary operations.

The mahout guys are using samsara. Netlib and its ilk aren’t distributed.

IIRC, we couldn;'t use NETLIB/Nd4J in the past because of the Apache license. But if it is possible to back Mahout AbstractMatrices with nd4j DataVecs, or whichever makes classes make the most sense, we could always use some scripts to build it in after downloading. I believe that this Has been done for MLK in several tools which would could not distribute the code, due to License. Do you think that nd4j matrices could work in as a in-core matrices?

@wesm do you have any thougts on Arrow, and if the colum-wise Tensor structure’s performance would be hurt too significantly by using a horizontially blockified out of core Distributed Row Matrix (which occasionally must iterate over in-core rows.). Suppose we could always use the transpose.

[9] [WIP]
[10] [WIP]

Please note: I’m pointing at the 0.13.0 branch of Mahout, we’re currently working on a 14.1 release to better organize. I may be afk for a couple of days, please pardon any mistakes above, I will clarify when i can, as well we have a technical doc for Adding new engines and new back ends to in-core matrices, that I’d like to dig up. Thanks, and I hope that this makes sense. I’d meant for this to be short, but it got a bit long… apologies, because I still have some ideas…

As well the reason that we are still in this state, is a different story, mainly due to the fact that we are an all volunteer project, several of us have had shifting IP restrictions at work, I myself have had health problems, the usual open source all volunteer issues, We are a cohesive community, we keep the project up to date and are in good health. can explain more down the line, there are other optimizations that i would like to discuss.

Apologies for all of the edits, I wanted to get this out quickly, I hope that there is not too much rambling, misspellings and or redundancy.

1 Like

Congratulations on your funding!

I don’t quite follow. We have defined some metadata structures for dense and sparse tensors/matrices so that Arrow’s serialization and zero-copy facilities can be utilized for working with tensor data falling outside of the columnar format

For those not familiar with the columnar format details see and especially my VLDB workshop slides from this past summer

We can spin this out into a separate discussion if you like.

1 Like

Ok yes, this actually answers my question.

I will read these closer.


@wesm I would be interested in this as well. We ended up implementing our own auto generated bindings for arrow adn want to use the arrow tensor abstraction as a first class data type if possible. We’re building data pipelines on top of arrow. I would like to see how we can flesh these things out a bit. We’re still orienting our roadmap on some of these things yet but would be interested in at least keeping up with developments for the data type in general.

1 Like

I’d be happy to discuss the details further on

Yes you should be able to. I would re evaluate that. I would maybe just verify what caused the issue. We’re part of the eclipse foundation now and went through their IP process. It’s similar to apache’s.
Maybe taking this offline, file an issue over at ?

Very interested - where is (will be) that going on @agibsonccc, @wesm ?

@SemanticBeeng basically we have ETL pipelines that will interop with our compute library.

The idea will be similar to what the arrow team is already doing with some of their compute kernels.
Arrow will be a viable input in to our tensorflow like framework samediff. It’s still WIP right now.
Think of it as similar to what will likely do with arrow at some point.

We’re taking advantage of the fact that we have an arrow binary pointer with javacpp built in. Because of javacpp we’re able to directly interop with all the different libraries in a zero copy like fashion in memory.

We also have a built in javacpp based cpython distribution which allows us to do leverage python
execution + pyarrow to just pass around pointers. I know arrow is zero copy for free of course (since it’s a self describing format). For deployment purposes, we’re finding it useful for different use cases.
What we’re hoping to do with this is use arrow’s footprint in other languages as a way of sending us binary data kind of like you would use it for spark today.

Sweet. How did I miss this work…

Did not study yet but quick one: this is all in-process, off-heap memory between JVM and CPython, none of the PySpark/socket stuff - right?

If so then I did something like that with #Jep (mentioned above).

Will dig into this.

Working on what I call #DataFabric to unify some frameworks and formats, including “higher-order tensors”, for both runtime and at-rest (“polystore” above).

Collected top references here :

I know it is a big favor but would … pay for your review and guidance of how it fits with your work / view. Would you be kind to provide when you get a chance, @agibsonccc ?

I believe we need type level unification first so we can pursue data lineage across frameworks and languages (“beyond #datapipeline”).

hi Nick / @SemanticBeeng – it seems like this is branching off into a discussion of a particular project you are interested in building, maybe you want to start a new thread?

Snippets from the LinkedIn conversation with Wes.

(Nick) I am a architect during ML engineering using polyglot, poly-framework and polystore tech stacks.

In doing so there are many challenges, many stemming from the heterogeneity, duplication and overall richness but also … mess of having all these frameworks available for reuse.
People coming into such a project pile together lots of things without much care for good design.

Because am trying to unify (below more on how), am firstly trying to understand the current functionality but also the vision of the existing frameworks on my list.

In doing so I must often ask … across the focus people working on the framework have when building it (user vs builder).

(Wes) I think it’s interesting but the presentation of the idea needs to not be a list of 25 tweet links. You could present your thesis in a way that is more intelligible to others. Reduce the self-referential tweets and focus on using clear and concise language to articulate the problem and what you believe to be possible solutions. I think the discussion is a bit impenetrable at the moment without taking hours to explore your mind map of tweets, if that makes sense.

Agreed, will do.

@SemanticBeeng Yeah I agree with @wesm let’s keep this thread to general JVM discussion :). I’m just documenting some of the things we’re seeing in the market here in the hopes other people see some of the “pro java” work going on in the space since usually the majority is python out there.