To follow up on @agibsonccc’s remark, Specifically with respect to JavaCPP, i coincidentally just visited the JavaCPP GitHub repo for the first time in a few years. I happily saw how its grown since last i looked, specifically of note is that JavaCPP now has bindings for Apache Arrow built around the native C++ implementation; this is significant as the Arrow project does have a “native Java” implementation, however the Python bindings are wrappers for the C++ side, thus we hope that PyArrow and JavaCPP Arrow match exactly.
In @SemanticBeeng’s words, the idea of an Off-heap JVM data as part of polyglot, poly-framework, a platform which shares native memory in a near seamless fashion is now nearly turnkey. I hate to say that I’ve not taken advantage of JavaCPP’s parser for presets to give back.
Tools like Apache Arrow, JavaCPP (which also has to up to date bindings for MKL 2020.0, TensorFlow, CUDA 10.2 among others), mono, LLVM, etc, along with increasingly sophisticated build-tools allow us to tailor our development around characteristics of the data we’re working with, and the functionality needed. I find myself approaching design in a more memory-centric way. Writing code around memory allows more freedom to choose a more optimal method for parallelism, or e.g. optimization with instruction sets , offloading to devices. The language (depending on the freedom i have to decide) is becoming almost secondary. I’m a Java developer first and foremost, and have been working with JNI since day one, it has never been easier to jump between languages (btw- I second the fact that the JVM plays nicely with GPUs and other devices).
For an example if working with somewhat static numerical data for which we need to computation, we have LAPACK FORTRAN routines, and rather than re-invent the wheel (or work in FORTRAN), we have the Intel maintained MKL which wraps the LAPACK subroutines into a nice C api, with a complete library, (actual an entire set of optimized primitives and threading, as well as
icc the Intel compiler) for us to link against, assuming the same architecture, and the same compilation, we have LAPACK available to us in our higher level languages; the performance differences should mostly be in VM overhead, memory allocation/de-allocation, jitting, etc. Python, in the JVM (via JNI 1 function at a time or now by importing e.g.
JacaCPP.mkl._) giving us Scala, Java, Kotlin as well as .NET C#, F# if you want, and the rest… Clojure Matlab, R… However we still need to manage our memory as efficiently as possible. Java/Scala allow for some pretty low level memory allocation with methods like
sun.misc.unsafe BufferedReaders/writers can be closed and flushed. JVM memory management, is IMO a good balance between Python and C++. Even with the per-compiled native code we want our memory to be as contiguous as possible.
At the moment, it does seem that the RapidsAI / Numba / Python3 / [CuPy | LLVM Jitted] / Arrow stack is doing quite well in the static numerical computation space. Numba allows for access to the SIMD instruction sets via LLVM jitting  though i haven’t had a chance to take a look at the latest and greatest yet just of today , which looks to do away with some of the the problems encountered in  via Jit.
Again i very much like the polyglot memory-centric approach to development. Say our memory is streaming, dynamic, does not fit in-core is of unknown dimension. Were it simply out of core, numerical and compute intensive, we’d probably want to use something like MPI with CUDA nodes in written C.
But more likely if we have streaming data of unknown dimension, in the world where we’d be looking at the JVM we’d be talking about an Apache Stack, Storm/Flink (Java) stream processors or Spark Streaming and Spark for the ETL if not a full Lambda/Kappa Architecture. For example if we are vectorizing and labeling images, and decide that If we decide that our processing needs are best met by OpenCV, it’s easy enough to drop the OpenCV JavaCPP dependency into our pom, compile it for the chosen architecture, and ship it out to the end nodes, vectorize the recognized section of the image, and on we go. If we feel that the actual reading of the License plate will be better performed by e.g. a PyTorch model, then we have a little extra work setting up the RapidsAI stack, essentially everything is available via conda, on each worker node.
However maybe the problem becomes a bit more complicated and we have an object recognition problem, maybe we want to crop a license plate which OpenCV has recognized, read it, name it by License Plate Number label it by state, and then vectorize the entire image. Say that the images are coming off of a set camera on a busy interstate, and now our milliseconds really count. Again looking at it from the data-centric POV this time with compute wall clock time a very high priority, and we want to use the RapidsAI stack , specifically we want to pass off, from a Spark
RDD[(image_idx: Long, Image: [T])].map(...) closure to a mixed Java and Python pipeline, returning an
RDD[(plate_state: Array[char](2), plate_number: String, image_idx: Long, vectorized_plate: Array[Int])] each vectorized image can be stored by an Arrow
shared_ptr after plate detection and outlining via OpenCV (JVM). The image, vectorized into a
DLPpack format  or handed off directly to a ML Platform using the
array_interface to a Python pipeline using the RapidAI stack already setup on the backed for state and plate number recognition by e.g. CuML, MxNet, PyTorch, etc in (multiple) GPU memory. This entire process incurs zero-copy overhead from Host to GPU memory.
(Note this method is currently not fully implemented in full, slated for Spark 3.0.0. This is mainly for illustrative purposes of a polyglot multi-platform language-agnostic approach to large scale data analytics)
In the end the purpose of all of this is to give 2 plausible examples of a platforms running Data-centric Applications on the JVM, their shortcomings, and their strengths, along with the ease of workarounds (e.g. JVM not using SIMD instructions). The first being an application using Fortran Lapack subprograms packaged and maintainesdMKL via JavaCPP, ApacheSpark providing Zero Copy access
I also have high hopes for GraalVM. As well the idea of Jitting routines as presented in , though not simply for SIMD access seems very promising.
Other times I find that i can saturate all CPUs on as many machines as I’ve estimated time-wise with simple
bash commands, and for surprisingly powerful tasks, these are the most useful tools.
If the Nature of the data is such that it will not fit in core memory, and the job is not worth a cluster, can be run over night e.g. the functionality is I/O bound; needs to be pulled from
S3:// processed and pushed back. Please give me Python and
In my opinion, neither Java nor the JVM is not obsolete. Its still a solid enterprise platform and has over the years has fixed some of the Security features that it was notorious for. I can’t speak to the future maintenance of the platform which like everything is uncertain, and left to the whims of a few. Scala is a great language which may have been a bit of a shooting star, burning out in its glory, It’s niche Seems to have been around the Spark ecosystem, and the learning curve a bit high for overworked programmers. I don’t know. More importantly though we don’t live in a Java/C-variant/Perl/Lisp-variant world anymore. The prevalence of new, solid languages, DSLs and libraries is something that I embrace. I see from above that I am thinking along the lines of people for whom i have great respect, adopting this inundation of new tech, looking at strengths and weaknesses of each and designing around the memory, however large or small those requirements happen to be. Language interoperability has never been better.
(thank you @SemanticBeeng for the paper… good read!)