Java / JVM development trends for data projects

Without a great deal of hard evidence I’ve been saying to people for a while that it seems like Java and JVM development in general as it relates to data processing is in a state of general decline.

I know of some specific new JVM-related projects (e.g. Apache Iceberg (incubating), things related to Apache Spark, and many products, like Dremio) that are active, and presumably most Hadoop-ecosystem projects are being maintained.

So there are several questions where I’m interested in the opinions of others:

  • Is Scala growing, stable, or in decline? It seems like Rust is capturing a lot of the excitement nowadays.
  • What is the net impact of university programs teaching less software engineering in Java to undergrads? For example, at MIT the “legendary” software engineering course for CS majors was 6.170, and it used to be in Java. Now it’s in JavaScript.
  • Is “Java the new COBOL”?

It would be interesting to collect some more scientific data about what is going on with the JVM ecosystem and how that is going to affect all of us over the next 10 years or so. (To be clear, I am uninterested in FUD, just the facts about what is happening or going to happen)

2 Likes

I don’t have much insight on where the trends are heading but I for sure have some opinions on this topic! Spark appears to be the de facto tool for distributed compute at many companies because it is relatively easy to use and provides access to the mature Java ecosystem.

Spark has pros and cons for sure. Much of it is row-based rather than columnar and at this point it isn’t trivial to change that.

As you know, I’ve been championing the use of Rust and there is some momentum here but not as much as I had hoped. More recently I’ve come to realize that building a Spark-like platform in [insert favorite programming language] is fundamentally flawed and just ties you into [favorite language] and makes integration with other languages and platforms difficult.

My view has now shifted to believe that we need to be building these platforms with Arrow first and define query plans in protobuf so that we build from the ground up with interoperability in mind. I am in the process of pivoting my Ballista [0] project to this point of view.

I have also started building out an Arrow native query engine in Kotlin [1]. I’m excited about Kotlin because it has the functional programming richness of Scala but is 100% compatible with Java. This new query engine is architecturally ready to leverage Gandiva.

Porting Scala/Spark code to Kotlin/Ballista might not be too much of an ordeal so it could provide a migration path. I am also working on interoperability with Spark via Arrow Flight protocol so that users can mix and match workloads. Of course, this is just a personal project and I don’t have much time to work on it, but hopefully, it helps influence the way developers think about building these types of project.

In summary, I think there is still a place for JVM for query planning, cluster management etc but for the parts of a data platform where performance is critical then I think Rust is well placed to gain some traction. The downside of Rust is that is is difficult to learn and slow to develop in compared to garbage-collected languages like Kotlin, IMO. That could possibly change over time though.

Is Scala in decline? I do feel that many developers are drifting towards Kotlin where possible (but Spark forces a lot of people to stick with Scala). If there were a compelling JVM alternative to Spark then I think Scala would probably be in decline. There is also some drama lately in the Scala community.

Is Java the new COBOL? I often make that joke when looking at my resume (20 years experience with Java at this point) but the language is starting to modernize with a lot of influence from Scala and Kotlin so I think Java actually has a bright future. I don’t think we can discount it just yet.

[0] https://github.com/andygrove/ballista
[1] https://github.com/andygrove/kotlin-query

2 Likes

There are many dimensions into this matter so I suggest the question has to be broken down.

#ProgrammingModel design point of view: Scala, Java and Kotlin applied to data products do indeed have a strong edge due to the #FunctionalProgramming nature (when used as such, ofc).

This is of key importance because data products are polyglot and FP brings a lot of reusable discipline for managing complexity (see state management, composition), especially in distributed and polyglot applications.

Side point but relevant I believe: FP applied for Deep Learning to glue frameworks https://twitter.com/semanticbeeng/status/1227886329213661185

Scala, Java and Kotlin as DSLs: There have been interesting projects that compile DSLs embedded in these languages to get higher abstraction or performance.

  1. How we made the JVM 40x faster
    https://twitter.com/semanticbeeng/status/1148330911949307904

  2. TornadoVM, a heterogeneous programming framework for Java programs
    https://2019.ecoop.org/details/aorta-2019-papers/4/Tornado-VM-A-Java-Virtual-Machine-for-Exploiting-High-Performance-Heterogeneous-Hard

  1. a simple API for composing pipelines of existing Java methods,
  2. an optimizing JIT compiler that extends the Graal compiler with hardware-aware optimizations that generate OpenCL C code, and
  3. a runtime system that executes TornadoVM specific bytecodes, performs memory management, and schedules the code for execution on GPUs, multicore CPUs, and FPGAs.
  1. Data-centric #metaprogramming in #Scala with implications to #ApacheSpark
    https://twitter.com/semanticbeeng/status/1227501147256016896
1 Like

Off-heap JVM data as part of polyglot, poly-framework, polystore #DataFabric

Evidently this is made possible by #ApacheArrow (needs no introduction)

  1. Used in #RapidsAI improves the TCO massively and kicks out frameworks like Airflow that force data saving to disk at the composition boundaries : https://twitter.com/rapidsai/status/1192158282301026310

  2. As shared runtime between deep learning frameworks (Python, C++, JVM):

  1. Integrated with #ApacheSpark (JVM), can be used to share large chunks of off-heap data in-process (“zero-copy”) with Python code (through Java Embedded Python framework; no PySpark and socket copying necessary)

See also #ApacheHudi and #DeltaLake as Java / Scala based technologies to make file based data lakes incremental: https://twitter.com/semanticbeeng/status/1196161901828493312

1 Like

I think a really interesting question in Scala world is what Scala 3 will do you the community.

1 Like

Apache Spark and JVM-related projects will probably still be relevant for the next 10 years. That’s not going to go away. I took Parallel Programming from Doug Lea about 10 years ago and I think it’s all very much relevant today.

The largest pain point in my job of managing many data teams over the last few years has been people resources. Finding talent capable of dealing with concurrency on the JVM and doing it well is either costly to acquire (think hiring COBOL developers) or takes too long to train them up. We have found that Golang provides an easy jump point for most engineers starting their career to dive into the realm of concurrency and distributed systems. Because of that, we’ve built a large Data Platform in Go with only a few components left on Spark (they are in the process of being rewritten in Go).

We still have many teams of Data Engineers using Databricks and Spark but I would argue that the majority of what they do does not require the JVM or Spark to do it. It’s simply a set of tools (DataFrames, reading/writing Parquet, etc.) that have been built out to the point that Data Engineers can quickly program ETL type jobs in notebooks.

The ability to have simple code where anyone can be shuffled between projects is a huge win for anyone managing teams and necessary when convincing leadership to buy into a tech stack; especially when you start talking about millions of dollars a year in licensing costs. In Go, I can train up a junior engineer to work on code that can have a large impact in a short time. Getting that same junior engineer up to speed in a JVM language like Scala and having the confidence they are proficient enough to not cause concurrency issues has taken exponentially longer.

Lastly, while Rust seems to be a good direction for “computation” based things, I would argue the complexity of the language once again has landed us in the same bucket as the JVM languages did. When I think “data projects” I think concurrency and when I think concurrency I think Go. While there are some crates the community has centered around for concurrent things in Rust, I would still be hesitant to pitch Rust to any leadership once again due to the personnel cost to impact ratio. Side note: we were able to reduce the training time of our reinforcement learning (contextual multi-armed bandits over a large parameter space) from 15 minutes in Spark/Scala to 10 ms in Go so I’ve yet to encounter computation non-negotiables in Go.

1 Like

I haven’t been following Scala 3 plans. Do you have an opinion on what this will do to/for the community? Will this provide better compatibility with Java?

Eclipse Deeplearning4j and Javacpp, the projects we maintain have always been a bit of a weird player in the java space. We maintain everything from our own allocators to our own package management for native binaries.

We’re a pretty big believer in java the core language as a system for developing solid networked services, but have generally taken a library approach to the ecosystem.This means not integrating with the full cross platform approach the JDK team themselves wants to take with project panama. We want things to “just work” and “just work” today. And not just work on “offiicial openjdk” but also on openj9 and android.

We view “java” as not “just the thing that runs spark”, but a language and runtime with both positive and negative aspects to working in machine learning. For us, it’s the following:

  1. A way of running android and ios. We actually prioritize running things on “odd non standard” jvms. Android has given the JVM ecosystem a number of neat libraries and has been an innovative corner in the space. We actually produce jars for ios and android as well as other various architectures. Here you’ll see math code pre optimized for flavors of avx, ios bindings, as well as 64 bit arm:
    https://repo1.maven.org/maven2/org/nd4j/nd4j-native/1.0.0-beta6/

  2. Contrary to popular belief, it’s also possible to make cuda and co “just work” in the JVM ecosystem. We’re capable of doing that in dl4j and are helping the tensorflow/java community tackle these packaging problems as well. The best part? We don’t even have to do JNI manually!

  3. Graalvm, despite its governance, actually has a lot of potential if done right. I’d like to see where the project goes and we are looking at it for some things ourselves.

  4. A language that needs workarounds for things. We have I think a controversial view on how java and native should work together, but we do pointers and c++ like programming in java via our packaging in javacpp. Wrapped the right way, it allows for performance like you would see in python and c but packaged in a better runtime than cython itself. It’s still a verbose language, but it allows us to integrate well with existing applications if you understand the trade offs.

  5. An alternative to go: Go has a great niche for networked services. Java still has netty and a great set of developers building solid networked stacks. Apple, Amazon, and co are still building a lot of services in java. Will it go away instantly? No. Will it eventually be replaced? Maybe over time. Even we ourselves are moving things more to a multi lingual future in dl4j itself.

  6. The home of spark. Spark itself has always done ETL well and continues to evolve with the ecosystem (kubernetes) but even flink is starting to become a viable competitor over time, especially backed by alibaba and cloudera now.

  7. A big potential in IoT. I’m a big fan of what’s going on with eclipse IoT and Bosch: https://iot.eclipse.org/ I think java on arm might be an interesting frontier if done right, again it’s not ideal and go can largely run on these architectures as well. I don’t ever see it dominating, but I do see it being a viable player.

Hopefully this makes sense! I hope some interesting ideas come from this exchange!

4 Likes

To follow up on @agibsonccc’s remark, Specifically with respect to JavaCPP, i coincidentally just visited the JavaCPP GitHub repo for the first time in a few years. I happily saw how its grown since last i looked, specifically of note is that JavaCPP now has bindings for Apache Arrow built around the native C++ implementation; this is significant as the Arrow project does have a “native Java” implementation, however the Python bindings are wrappers for the C++ side, thus we hope that PyArrow and JavaCPP Arrow match exactly.

In @SemanticBeeng’s words, the idea of an Off-heap JVM data as part of polyglot, poly-framework, a platform which shares native memory in a near seamless fashion is now nearly turnkey. I hate to say that I’ve not taken advantage of JavaCPP’s parser for presets to give back.

Tools like Apache Arrow, JavaCPP (which also has to up to date bindings for MKL 2020.0, TensorFlow, CUDA 10.2 among others), mono, LLVM, etc, along with increasingly sophisticated build-tools allow us to tailor our development around characteristics of the data we’re working with, and the functionality needed. I find myself approaching design in a more memory-centric way. Writing code around memory allows more freedom to choose a more optimal method for parallelism, or e.g. optimization with instruction sets [1], offloading to devices. The language (depending on the freedom i have to decide) is becoming almost secondary. I’m a Java developer first and foremost, and have been working with JNI since day one, it has never been easier to jump between languages (btw- I second the fact that the JVM plays nicely with GPUs and other devices).

For an example if working with somewhat static numerical data for which we need to computation, we have LAPACK FORTRAN routines, and rather than re-invent the wheel (or work in FORTRAN), we have the Intel maintained MKL which wraps the LAPACK subroutines into a nice C api, with a complete library, (actual an entire set of optimized primitives and threading, as well as icc the Intel compiler) for us to link against, assuming the same architecture, and the same compilation, we have LAPACK available to us in our higher level languages; the performance differences should mostly be in VM overhead, memory allocation/de-allocation, jitting, etc. Python, in the JVM (via JNI 1 function at a time or now by importing e.g. JacaCPP.mkl._) giving us Scala, Java, Kotlin as well as .NET C#, F# if you want, and the rest… Clojure Matlab, R… However we still need to manage our memory as efficiently as possible. Java/Scala allow for some pretty low level memory allocation with methods like sun.misc.unsafe BufferedReaders/writers can be closed and flushed. JVM memory management, is IMO a good balance between Python and C++. Even with the per-compiled native code we want our memory to be as contiguous as possible.

At the moment, it does seem that the RapidsAI / Numba / Python3 / [CuPy | LLVM Jitted] / Arrow stack is doing quite well in the static numerical computation space. Numba allows for access to the SIMD instruction sets via LLVM jitting [2] though i haven’t had a chance to take a look at the latest and greatest yet just of today [3], which looks to do away with some of the the problems encountered in [1] via Jit.

Again i very much like the polyglot memory-centric approach to development. Say our memory is streaming, dynamic, does not fit in-core is of unknown dimension. Were it simply out of core, numerical and compute intensive, we’d probably want to use something like MPI with CUDA nodes in written C.

But more likely if we have streaming data of unknown dimension, in the world where we’d be looking at the JVM we’d be talking about an Apache Stack, Storm/Flink (Java) stream processors or Spark Streaming and Spark for the ETL if not a full Lambda/Kappa Architecture. For example if we are vectorizing and labeling images, and decide that If we decide that our processing needs are best met by OpenCV, it’s easy enough to drop the OpenCV JavaCPP dependency into our pom, compile it for the chosen architecture, and ship it out to the end nodes, vectorize the recognized section of the image, and on we go. If we feel that the actual reading of the License plate will be better performed by e.g. a PyTorch model, then we have a little extra work setting up the RapidsAI stack, essentially everything is available via conda, on each worker node.

However maybe the problem becomes a bit more complicated and we have an object recognition problem, maybe we want to crop a license plate which OpenCV has recognized, read it, name it by License Plate Number label it by state, and then vectorize the entire image. Say that the images are coming off of a set camera on a busy interstate, and now our milliseconds really count. Again looking at it from the data-centric POV this time with compute wall clock time a very high priority, and we want to use the RapidsAI stack [4], specifically we want to pass off, from a Spark RDD[(image_idx: Long, Image: [T])].map(...) closure to a mixed Java and Python pipeline, returning an RDD[(plate_state: Array[char](2), plate_number: String, image_idx: Long, vectorized_plate: Array[Int])] each vectorized image can be stored by an Arrow shared_ptr after plate detection and outlining via OpenCV (JVM). The image, vectorized into a DLPpack format [6] or handed off directly to a ML Platform using the array_interface to a Python pipeline using the RapidAI stack already setup on the backed for state and plate number recognition by e.g. CuML, MxNet, PyTorch, etc in (multiple) GPU memory. This entire process incurs zero-copy overhead from Host to GPU memory.

(Note this method is currently not fully implemented in full, slated for Spark 3.0.0. This is mainly for illustrative purposes of a polyglot multi-platform language-agnostic approach to large scale data analytics)

In the end the purpose of all of this is to give 2 plausible examples of a platforms running Data-centric Applications on the JVM, their shortcomings, and their strengths, along with the ease of workarounds (e.g. JVM not using SIMD instructions). The first being an application using Fortran Lapack subprograms packaged and maintainesdMKL via JavaCPP, ApacheSpark providing Zero Copy access

I also have high hopes for GraalVM. As well the idea of Jitting routines as presented in [1], though not simply for SIMD access seems very promising.

Other times I find that i can saturate all CPUs on as many machines as I’ve estimated time-wise with simple bash commands, and for surprisingly powerful tasks, these are the most useful tools.

If the Nature of the data is such that it will not fit in core memory, and the job is not worth a cluster, can be run over night e.g. the functionality is I/O bound; needs to be pulled from S3:// processed and pushed back. Please give me Python and Boto3.

In my opinion, neither Java nor the JVM is not obsolete. Its still a solid enterprise platform and has over the years has fixed some of the Security features that it was notorious for. I can’t speak to the future maintenance of the platform which like everything is uncertain, and left to the whims of a few. Scala is a great language which may have been a bit of a shooting star, burning out in its glory, It’s niche Seems to have been around the Spark ecosystem, and the learning curve a bit high for overworked programmers. I don’t know. More importantly though we don’t live in a Java/C-variant/Perl/Lisp-variant world anymore. The prevalence of new, solid languages, DSLs and libraries is something that I embrace. I see from above that I am thinking along the lines of people for whom i have great respect, adopting this inundation of new tech, looking at strengths and weaknesses of each and designing around the memory, however large or small those requirements happen to be. Language interoperability has never been better.

Tl;DR I have high hopes that Java/JVM based languages survive I don’t see them as obsolete, It’s unfortunate that JavaScript is being taught before C, C should be mandatory in CS undergrad IMO, if only for the understanding of strong typing and memory allocation. I can’t say what the net impact will be, but I’d imagine that learning JavaScript over Java (or more importantly C) is likely going to have some impact in terms of best practices, as in there has only really ban a unified specification for 5 years or so. Its an extremely powerful language, and one can do a lot with a little, but i don’t know enough about how e.g. multi-threading works with JavaScript. Are there any kind of standalone JavaScript clusters? How does one handle distributed memory? I know that many libraries that I’m familiar with in C or Java have JavaScript bindings, but i’ve never seen anyone use them. IMO, Bottom line is that those who are not natural logicians or programmers will have trouble learning other languages if they learn JavaScript in CS. I do see in some ways Java being the new COBOL, but at the same time C/C++ are the new COBOL as well as the new C/C++. There will always be legacy systems that need maintenance, with interest declining in Java and Scala, it seems that there will be a lot of high paying legacy work out there. However, my favorite interview question for Data Engineers is to explain briefly their knowledge of JavaCPP. If not that specific tool, another automated JNI parser. Some of the most successful Data Engineers work regularly in Java/Scala with JNI. Spark has plenty of room for custom tailoring.

–Andy

[1] https://astojanov.github.io/publications/preprint/004_cgo18-simd.pdf
(thank you @SemanticBeeng for the paper… good read!)
[2] http://numba.pydata.org/numba-doc/latest/user/performance-tips.html#intel-svml
[3] https://github.com/deepmind/rlax
[4] https://numba.pydata.org/
[5] https://rapids.ai/about.html
[6] https://jira.apache.org/jira/browse/SPARK-26413
[6] https://github.com/dmlc/dlpack
[7] https://github.com/rapidsai/cuml#-open-gpu-data-science

4 Likes

Little story from the academic side:

Some years ago I worked on a demonstration project with the UK Data Service where they wanted to provide (Py)Spark-based big data computing resources to academics. It did sort of work, but oh the problems - JVM out of memory errors, XML config (oh god the XML :sob:) . More importantly though, as an academic Spark (and JVM things in general) always felt like something from the business world, where the problems and workflow and solutions are all pretty well known and defined. In research you quickly end up wanting to do something a bit different, you always need some bespoke logic or want to use a niche library, and Spark made this unnecessarily complicated.

Today, I can run the same analysis on a decent workstation using classic Python data science tools.

So for me there are two things really working against JVM in data projects:

First, computer processing speeds are catching up to a lot of even pretty intense workloads. You just don’t need a Spark cluster anymore for a lot of tasks, and although many people want to be in the Big Data cool club the reality is their data is not actually that big or growing.

Secondly, if you do need parallel/cluster processing, the Python ecosystem is simpler and more flexible. You can get basically the same functionality but without all the ceremony of setting it up, and you can easily integrate the long tail of specialised libraries. I suspect people are getting used to Python in their studies/PhDs and entering the workplace expecting to be able to use the same tools, and that this will drive the evolution of tech stacks in companies.

1 Like

+1
But really meant also " … polystore" in my initial post. Both the data in motion and data at rest must be polyglot: no Python pickles, for example; And not all data is relational : we need tensor, graph, NLP “embeddings”, etc

See #tsalib, #tileDB and #StructType for ideas

Indeed.
This can be seen as lack of vision from Python guys about productionization.

But am more worried about JVM people missing out opportunities from big data analytics.
There is a need to bring awareness and influence.

I believe that showing how to use Scala/Java and Python in-process to code business logic / algos around shared data is key to make the case.
Done that with #jep DirectNDArray.

Possibly - after all a lot of the Python scientific stack is just that - scientific. In research there is little to no value in productionization. That said people do run big Python based systems, though TBH I have no idea what the challenges are since I have really only seen the other end: how difficult it is as a single person/small team to set up JVM based systems designed for Enterprise with a whole dedicated IT department (and presumably a guy whose only job is writing XML :stuck_out_tongue: )

But really meant also " … polystore" in my initial post. Both the data in motion and data at rest must be polyglot: no Python pickles, for example; And not all data is relational : we need tensor, graph, NLP “embeddings”, etc

aha-- I though i’d missed something there. yes was late night, and I missed your meaning of polystore. Absolutely. Something like PMML but not as terrible, and as you say, also for data in motion.

Originally I’d thought that Arrow would be capturing something like this, by serializing to Parquet. But as i understand it, that is mainly for structured data.

I’ve not used DLPack, its something that seemed to be going in that direction… https://github.com/dmlc/dlpack before, but it seems like something useful…

1 Like

Circling back purely to the open source perspective, it would be interesting to try to estimate the number of full-time engineers working on data-related projects in each programming language. In other words, “follow the money”. Everyone has lots of opinions about what directions would be promising given infinite development resources, but I’m interested in the hard empirical data about who is spending money where.

This could be extracted by analyzing GitHub repository analytics if a suitably comprehensive list of projects could be assembled. The data is likely very noisy (for example, some projects have code in multiple programming languages), but you still might be able to get a ballpark estimate.

Some thoughts that jump to mind:

  • As one example if would not surprise me to learn that Google Brain is spending $50 million / year or more on ML-related open source development for Python and Swift programmers (having people like Chris Lattner on staff does not come cheap) – this includes ecosystem-related projects like MLIR and JAX.
  • I don’t know how exactly many full-time people NVIDIA has working on the RAPIDS ecosystem (which is all C++ / CUDA / Python) but I think it’s pushing 50 or so.
  • Amazon, Facebook, and Microsoft all are similarly making significant investments in tools, libraries, and cloud infrastructure in support of data developers – most of this is oriented at either Python/R or Apache Spark users or some mix of the two.
  • The Ray (Python distributed computing framework) developers from UC Berkeley just raised $20M to found anyscale.io
  • There are not one but two Dask companies now (https://www.saturncloud.io/s/ and https://coiled.io/)
  • There are many other venture-funded data companies – public and stealth – focusing on non-JVM data technologies founded in the last 5 years

It could be that we’re experiencing some kind of peak in the hype cycle of Python/R-related machine learning tech, but I wouldn’t want to take on the other side of the bets that the tech juggernauts are making right now.

Again, quantitative data would be useful, maybe this is an effort we could crowd source. It would also be useful to be able to keep track of the magnitude of open source contributions broken down by the e-mail addresses found in the git logs.

It’s not growing as rapidly as a few years ago but it’s not in decline. The Scala community right now is a tripartite of 1) spark community 2) akka / lightbend / microservices 3) category theory / type systems / pure FP community.

There is some hype that Kotlin will take over since it’s backed by Google and Scala is community driven. Regardless whether there’s truth to the hype I think it’s a sad indictment of the state of open source today that community guardianship is seen as a competitive disadvantage.

2 Likes

@wesm There are many markets when it comes to software development and each market has its own distribution of programming languages. Java (although a platform, not just a programming language) happens to reside mostly in the enterprise market, while C and to a certain extent C++ is often the only option available for applications in the embedded/edge/low-level market, scientists mainly use Python these days, hobbyists experiment a lot with Rust, Swift, etc. Anyway, the point is, to talk about Java, we need to focus on the enterprise market where it matters as @agibsonccc @apalumbo @SemanticBeeng point out. In my opinion, what’s happening there is not Java or its ecosystem declining, what’s happening is that the focus on the JCP and OpenJDK is diminishing. Since their processes have been so central to Java since the beginning, many people incorrectly attribute this to a decline of the ecosystem in general. A few data points with regards to this:

  • GraalVM was spinned off OpenJDK, and although still part of Oracle, attracts more attention than OpenJDK itself these days
  • Neither Kotlin nor Scala or anything related such as Spark appear anywhere as JSRs and are not related to OpenJDK in any way
  • There are now multiple forks of the JDK (Android, Corretto, Dragonwell, Liberica, Open J9, Zulu, etc), although they are still mostly compatible with each other
  • Java EE was recently moved away from the JCP to the Eclipse Foundation under the name of Jakarta EE

The JCP and OpenJDK have many failings, and although they are generally doing a good job at developing the JDK itself, they are not so good at developing the tools and libraries surrounding it. For example, Maven has nothing to do with them. It happened despite them because the enterprise market needed that kind of tool, and nothing like it exists anywhere else. However, Java hasn’t been so lucky with other things. It has become abundantly clear that code on the JVM is never going to be as fast as properly optimized C/C++ code, and that we’re never going to code GPUs and other accelerators with Java (or Python for that matter). So we need tools to deal with native code and libraries. Python has tools like conda and pip that can package native libraries just fine, and other tools like cython and pybind11 exist that let us use C/C++ libraries easily enough. But those are sorely missing for Java, and that’s what I’ve been trying to compensate for with JavaCPP:

OpenJDK initially tried to do something about this, but they have officially given up on coming up with user-friendly high-level APIs:

The community had been pinning their hopes on Project Panama, but now that it’s clear it won’t be relevant (other than to provide a low-level alternative to JNI), things are going to become interesting. I’m expecting something from GraalVM, unless they decide to expand their use of JavaCPP:

So, where does that leave us today? Recent popular projects such as Arrow and cuDF, just to take these 2 as examples of “data projets”, still go right ahead and decide to write JNI bindings manually, hacking together one-off packaging and loading mechanisms, giving us libraries that are both lacking in features and in performance. Of course Java developers are not happy about using such poorly written and packaged code! They’re used to JDK-level quality. Neither can we expect enterprises to put software into production that was compiled using random versions of C++ compilers, that behaves inconsistently across libraries, especially at load time, and that can’t interoperate with any kind of efficiency! Fortunately, the tide is turning, the community is realizing that we’re missing something here. Recently, I’ve personally replaced all the manually written JNI code in TensorFlow’s codebase, halving the lines of code required, and increasing the performance to boot, all that in just a few days of work:

If Arrow and cuDF remain relevant over the next few years, something similar will happen for those as well, maybe with JavaCPP, maybe with something better. For the moment, since we actually need to access functionality that isn’t available in the official bindings, and for performance and usability reasons as well, we’ve started to refactor our codebase with JavaCPP to use–from Java–their C++ APIs instead:

In my opinion, people are getting the message, and we will continue to see increased activity around providing the community with more tools like that. I think this is what we should be looking at. When it feels like Java is dead, take a look at the number of projects that have moved away from writing JNI manually and things like that, and consider the amount of distributions containing native libraries, such as the JavaCPP Presets, that are going to appear over the next few years. I think what happens in that space could become a good health indicator of sorts for the Java community.

1 Like

FWIW, the dl4j team just raised 800 million USD ourselves:

As for our customers/users, typically there isn’t a lot of ceremony around deployment, but you have big, non oracle companies doubling down on JVM as their main way of writing infrastructure:


IBM buying red hat also tells me java isn’t done yet.

Money wise, I think there is a lot of value in making the JVM ecosystem a strong competitor at least for enterprises. It allows deployment in to a lot of existing tech.

For every anecdote I hear about python being flexible, I can counter with one where a big company buys a startup that builds something cool in python, tries to deploy it in to a container, and find it memory leaks. That memory leak isn’t going to be fixed either. It’s because it’s based on an old fork of something google produced and moved on from. The python community in research tends to be vastly ahead of the one in industry, with maintenance not being the largest priority.

A wide swath of our users adopt dl4j for our keras import. You can deploy it as a jar file (just like you would a go binary) and you don’t need to worry about pinning the system level version of glibc and TF in order for it to work reliably.

JVM as a culture at least tends to have maintenance and longevity as a priority, and a lot of python tends to get rewritten/implemented in java. There are just trade offs to each culture and technology.

2 Likes

@wesm

As one example if would not surprise me to learn that Google Brain is spending $50 million / year or more on ML-related open source development for Python and Swift programmers (having people like Chris Lattner on staff does not come cheap) – this includes ecosystem-related projects like MLIR and JAX.

Just of note here, Chris left google. https://www.sifive.com/blog/with-sifive-we-can-change-the-world