The GPU acceleration angle is interesting, especially for workloads that fall between OLAP and ML. Curious how it handles scaling across mixed CPU/GPU infra though.
andygrove 7 hours ago [-]
Congrats on the launch!
I contributed to the NVIDIA Spark RAPIDS project for ~4 years and for the past year have been contributing to DataFusion Comet, so I have some experience in Spark acceleration and I have some questions!
1. Given the momentum behind the existing OSS Spark accelerators (Spark RAPIDS, Gluten + Velox, DataFusion Comet), have you considered collaborating with and/or extending these projects? All of them are multi-year efforts with dedicated teams. Both Spark RAPIDS and Gluten + Velox are leveraging GPUs already.
2. You mentioned that "We're fully compatible with Spark SQL (and Spark)." and that is very impressive if true. None of the existing accelerators claim this. Spark compatibility is notoriously difficult with Spark accelerators built with non-JVM languages and alternate hardware architectures. You have to deal with different floating-point implementations and regex engines, for example.
Also, Spark has some pretty quirky behavior. Do you match Spark when casting the string "T2" to a timestamp, for example? Spark compatibility has been pretty much the bulk of the work in my experience so far.
Providing acceleration at the same time as guaranteeing the same behavior as Spark is difficult and the existing accelerators provide many configuration options to allow users to choose between performance and compatibility. I'm curious to hear your take on this topic and where your focus is on performance vs compatibility.
winwang 6 hours ago [-]
1. Yes! Would love to contribute back to these projects, since I am already using RAPIDS under the hood. My general goal is to bring GPU acceleration to more workloads. Though, as solo founder, I am finding it difficult to have any time for this at the moment, haha.
2. Hmm, maybe I should mention that we're not "accelerating all operations" -- merely compatible. Spark-RAPIDS has the goal of being byte-for-byte compatible unless incompatible ops are specifically allowed. But... you might be right about that kind of quirk. Would not be surprising, and reminds me of checking behavior between compilers.
I'd say the default should be a focus on compatibility, and work through any extra perf stuff with our customers. Maybe a good quick way to contribute back to open source is to first upstream some tests?
It reminds me of Hadoop days, where the data would be stored in the HDFS and you would use mapreduce to process it. However, the concept was to send computation to the location of the data.
This really make sense. I might be a little out of touch. I wonder, do you incur transfer cost when you data is in buckets and you process by bringing data to the compute.
winwang 2 hours ago [-]
If you stand up your compute cluster in the same region as your bucket, there are no egress fees. Otherwise, yes, in general. There are some clouds that don't have egress fees though, i.e. Cloudflare R2.
No relationship... yet! Hoping to have a good relationship in the future so I have a business reason to fly to Japan :D
Btw, interesting thing they said here: "By utilization of GPU (Graphic Processor Unit) device which has thousands cores per chip"
It's more like "hundreds", since the number of "real" cores is like (CUDA cores / 32). Though I think we're about to see 1k cores (SMSPs).
That being said, I do believe CUDA cores have more interesting capabilities than a typical vector lane, i.e. for memory operations (thank the compiler). Would love to be corrected!
threeseed 13 hours ago [-]
Many of us have been using GPU accelerated Spark for years:
Indeed, Spark-RAPIDS has been around for a while! And it's quite simple to have a setup that works. Most of the issues come after the initial PoC, especially for teams not wanting to manage infra, not to mention GPU infra.
random17 17 hours ago [-]
Congrats on the launch!
Im curious about what kinds of workloads you see GPU-accelerated compute have a significant impact, and what kinds still pose challenges. You mentioned that I/O is not the bottleneck, is that still true for queries that require large scale shuffles?
winwang 16 hours ago [-]
Large scale shuffles: Absolutely. One of the larger queries we ran saw a 450TB shuffle -- this may require more than just deploying the spark-rapids plugin, however (depends on the query itself and specific VMs used). Shuffling was the majority of the time and saw 100% (...99%?) GPU utilization. I presume this is partially due to compressing shuffle partitions. Network/disk I/O is definitely not the bottleneck here.
It's difficult to say what "workloads" are significant, and easier to talk about what doesn't really work AFAIK. Large-scale shuffles might see 4x efficiency, assuming you can somehow offload the hash shuffle memory, have scalable fast storage, etc... which we do. Note this is even on GCP, where there isn't any "great" networking infra available.
Things that don't get accelerated include multi-column UDFs and some incompatible operations. These aren't physical/logical limitations, it's just where the software is right now: https://github.com/NVIDIA/spark-rapids/issues
Multi-column UDF support would likely require some compiler-esque work in Scala (which I happen to have experience in).
A few things I expect to be "very" good: joins, string aggregations (empirically), sorting (clustering). Operations which stress memory bandwidth will likely be "surprisingly" good (surprising to most people).
Otherwise, Nvidia has published a bunch of really-good-looking public data, along with some other public companies.
Outside of Spark, I think many people underestimate how "low-latency" GPUs can be. 100 microseconds and above is highly likely to be a good fit for GPU acceleration in general, though that could be as low as 10 microseconds (today).
_zoltan_ 15 hours ago [-]
8TB/s bandwidth on the B200 helps :-) [yes, yes, that is at the high end, but 4.8TB/s@H200, 4TB/s@H100, 2TB/s@A100 is nothing to sneeze at either).
winwang 14 hours ago [-]
Very true. Can't get those numbers even if you get an entire single-tenant CPU VM. Minor note, A100 40G is 1.5TB/s (and much easier to obtain).
That being said, ParaQuery mainly uses T4 and L4 GPUs with "just" ~300 GB/s bandwidth. I believe (correct me if I'm wrong) that should be around a 64-core VM, though obviously dependent on the actual VM family.
dogman123 15 hours ago [-]
This could be incredibly useful for me. Currently struggling to complete jobs with massive amounts of shuffle with Spark on EMR (large joins yielding 150+ billion rows). We use Glue currently, but it has become cost prohibitive.
Then mount FSX for Lustre on all of your EMR nodes and have it write shuffle data there. It will massively improve performance and shuffle issues will disappear.
Is expensive though. But you can offset the cost now because you can run entirely Spot instances for your workers as if you lose a node there's no recomputation of the shuffle data.
winwang 14 hours ago [-]
Is the shuffle the biggest issue? Not too sure about joins but one of the datasets we're currently dealing with has a couple trillion rows. Would love to chat about this!
mritchie712 14 hours ago [-]
> they're saving over 60% off of their BigQuery bill
how big is their data?
A lot of BigQuery users would be surprised to find they don't need BigQuery.
This[0] post (written by founding engineer of BigQuery) has a bit of hyperbole, but this part is inline with my experience:
> A couple of years ago I did an analysis of BigQuery queries, looking at customers spending more than $1000 / year. 90% of queries processed less than 100 MB of data. I sliced this a number of different ways to make sure it wasn’t just a couple of customers who ran a ton of queries skewing the results. I also cut out metadata-only queries, which are a small subset of queries in BigQuery that don’t need to read any data at all. You have to go pretty high on the percentile range until you get into the gigabytes, and there are very few queries that run in the terabyte range.
We're[1] built on duckdb and I couldn't be happier about it. Insanely easy to get started with, runs locally and client-side in WASM, great language features.
They have >1PB of data to ETL, with some queries hitting 450TB of pure shuffle.
It's very true that most users don't need something like BigQuery or Snowflake. That's why some startups have come up to save Snowflake cost by "simply" putting a postgres instance in front of it!
In fact, I just advised someone recently to simply use Postgres instead of BigQuery since they had <1TB and their queries weren't super intensive.
threeseed 13 hours ago [-]
> A lot of BigQuery users would be surprised to find they don't need BigQuery.
No they wouldn't.
a) BigQuery is the only managed, supported solution on GCP for SQL based analytical workloads. And they are using it because they started with GCP and then chose BigQuery.
b) I have supported hundreds of Data Scientists over the years using Spark and it is nothing like BigQuery. You need to have much more awareness of how it all fits together because it is sitting on a JVM that when exposed to memory pressure will do a full GC and kill the executor. When this happens at best your workload gets significantly slower and at worst your job fails.
winwang 12 hours ago [-]
Hopefully, we can be another managed solution for those on GCP.
And as for your second point, yep, Spark tuning is definitely annoying! BigQuery is a lot more than jusr the engine, and building a simple interface for a complicated, high-performance process is hard. That's a big reason why I made ParaQuery.
threeseed 12 hours ago [-]
You may want to look into DataMechanics who is another YC startup who tried something similar. They were acqui-hired by NetApp.
If I remember they focused on SME space because in enterprise you will likely struggle against pre-allocated cloud spend budgets which lock companies into just using GCP services. I've worked at a dozen enterprise companies now and every one had this.
winwang 12 hours ago [-]
Enterprises can deploy on their own GCP, and we're planning on releasing on GCP Marketplace.
For a similar cost, what if their pipeline were 5x faster, and they don't have to dealing with managing the deployment themselves?
Thanks for telling me about DataMechanics
threeseed 10 hours ago [-]
a) Enterprises have almost entirely moved away from self-hosting software. GCP Marketplace is fine but I would probably also look at a Kubernetes option as many companies have GKE clusters.
b) It won't be 5x faster though and I wrongly recommend you don't take a marketing attitude when selling this type of software. Because it will be mostly technical engineers and architects deciding on this and we aren't stupid. I have run GPU accelerated Spark clusters for years for enterprise companies and you will be able to accelerate the query part of the pipeline but that's like 20% of what a typical job does.
winwang 10 hours ago [-]
a) Since being fully-managed is one of my value props, that's probably better for us.
b) Of course I'm only accelerating the Spark/query part. Not sure what you mean. And in that case, I took a query which was 44 minutes on BigQuery and ran it with a "comparable" cluster on ParaQuery in 5.5 minutes. Perf is slightly variable, so maybe it's 40 minutes vs 6 minutes. In that case, ParaQuery would still be 6.5x faster, and >2x cheaper. That being said, it was just a benchmark ETL query with some random data (50b rows), and these things do vary between workloads.
So yeah, without knowing more about the use case you're talking about, hard to say. Even Nvidia has a hard time optimizing certain TPS-DS queries btw, so it's not like I can just 5x anything!
mritchie712 9 hours ago [-]
> No they wouldn't.
haha, you're giving people way too much credit. Tons of people make bad software purchasing decisions. It's hard, people make mistakes.
Boxxed 14 hours ago [-]
I'm surprised the GPU is a win when the data is coming from GCS. The CPU still has to touch all the data, right? Or do you have some mechanism to keep warm data live in the GPUs?
winwang 14 hours ago [-]
Yep, CPU has to transfer data because no RDMA setup on GCP lol. But that's like 16-32 GB/s of transfer per GPU (assuming T4/L4 nodes), which is much more than network bandwidth. And we're not even network bound, even if there's no warm data (i.e. for our ETL workloads). However, there is some stuff kept on GPU during actual execution for each Spark task even if they aren't running on the GPU at the moment, which makes handling memory and partition sizes... "fun", haha.
threeseed 12 hours ago [-]
I've used GPU based Spark SQL for many years now and it sounds flashy but it's not going to make a meaningful difference for most use cases.
As you say the issue is that you have an overall process to optimise from getting the data off slow GCS onto the nodes, shuffling it which often then writes it to a slow disk before the real processing even starts then writing back to a slow GCS.
winwang 11 hours ago [-]
Not sure what your use cases are, but I haven't had too much issue seeing good gains vs bare Spark -- GCS has not been my bottleneck.
_zoltan_ 3 hours ago [-]
would you be able to share a runtime with operator breakdown for the curious ones among us?
winwang 2 hours ago [-]
That's a pretty interesting idea, might take a bit to prepare a useful graphic/post.
Also, what do you think would be the best way to structure such a post?
But, here's a small bit of something perf-y: during large shuffles, I was able to increase overall job performance/efficiency by using external shuffles, even with times of ~5s median shuffle write for a couple hundred MB partitions (I hope I'm remembering this correctly, lol). This is not particularly great, but it did allow for cost-efficiently chewing through some rather large datasets without dealing with memory issues. There's also an awesome side benefit in that it allows us to use cheap spot workers in more scenarios.
achennupati 13 hours ago [-]
this is super cool stuff, and would've been really interesting to apply to spark stuff I had to do for Amazon's search system! How is this different than something like using spark-rapids on AWS EMR with GPU-enabled EC2 instances? Are you building on top of that spark-rapids, or is this a more custom solution?
winwang 13 hours ago [-]
Good question -- it depends. For certain workloads, it might look exactly the same! For others, I found that the memory and VM constraints were creating large inefficiencies. Also, many teams simply don't want to manage that level of data infra: managing EMR, instance type optimization, spark optimization (now with GPU configs!), custom images, upgrades, etc.
We take care of that and make it as easy as pie... or so we hope! On top of that, we also deploy an external shuffle service, and deal with other plugins, connectors, etc.
I suppose it's similar to using Databricks Serverless SQL!
Another thing: we ran into an incompatible (i.e. non-accelerated) operation in one of our first real workloads, so we worked with our customer to speed up that workload even more with a small query optimization.
rockostrich 13 hours ago [-]
> I started working on actually deploying a cloud-based data platform powered by GPUs (i.e. Spark-RAPIDS)
Based on this, the platform is using Spark-RAPIDS.
_zoltan_ 15 hours ago [-]
couple weeks ago at VeloxCon one of the days was dedicated to GPU processing (the other being AI/ML data preprocessing); the cuDF team talked about their Velox integration as well. for those interested, might worth to check it out.
disclaimer: my team is working on this very problem as well, as I was a speaker at VeloxCon.
winwang 15 hours ago [-]
Just checked out Velox. It's awesome that you're reducing duplicate eng effort! What was your talk about?
_zoltan_ 14 hours ago [-]
I was part of the panel discussion at the end of the 2nd day discussing hardware acceleration for query processing. before the panel there were very interesting talks about various approaches on how to get to the end goal, which is to efficiently use hardware (non-x86 CPUs, so mostly either GPUs or FPGAs/custom chips) to speed up queries.
winwang 14 hours ago [-]
Since you mentioned non-x86, how are things on the ARM side? I believe I heard AWS's Graviton + Correto combo was a huge increase for JVM efficiency.
FPGAs... I somehow highly doubt their efficiency in terms of being the "core" (heh) processor. However, "compute storage" with FPGAs right next to the flash is really interesting.
billy1kaplan 14 hours ago [-]
Congrats on the launch, Win! I remember seeing your prototype a while ago and its awesome to see your progress!
latchkey 13 hours ago [-]
Great, let me know if you want a 1x AMD MI300x VM to build/test on. Free.
Nice! I attended a hackathon by Modular last weekend where we got to play with MI300X (sponsored by AMD and Crusoe). My team made a GPU-"accelerated" BM25 in Mojo, but mostly kind of failed at it, haha.
The software stack for AMD is still a bit too nascent for ParaQuery's Spark engine, but certain realtime/online workloads can definitely be programmed pretty fast. They also happen to benefit greatly from the staggering levels of HBM on AMD chips. Hopefully I can take a mini-vacation later in the summer to hack on your GPUs :)
latchkey 13 hours ago [-]
Great feedback and thanks for the follow on twitter.
Agreed, their software stack needs work, but thankfully that's why they are sponsoring developer events like you attended. The progress is happening fast and this is something that wasn't happening at all 6-12 months ago. It is a real shift in focus.
If you have specific areas you'd like me to pass up the chain for them in order for you to build support for your engine, please let me know and I'm happy to try to help however I can. It took a while for us to get there, but they are now extremely responsive to us.
winwang 12 hours ago [-]
All the Spark GPU acceleration right now is done via the Spark-RAPIDS plugin, so HIP would somehow have to support that. Since cuDF is the core part and hipDF is a thing, it might be doable in the near future.
latchkey 11 hours ago [-]
Oh, interesting! Thanks for the additional context.
This is one area where it is clear that Nvidia is a leader by not only providing the underlying kernels, but also the overall product framework integrations. One would have to port that entire plugin project over, which would probably be a ton of work to maintain.
For what it is worth, AMD just recently released two blog posts on hipDF, so at least they are putting that effort in.
Thanks for the links! I'm planning on contributing kernels back to open source, so will think of a way to be vendor agnostic. As far as I understand, HIP should make that doable.
latchkey 11 hours ago [-]
hipify attempts to do that, but it potentially requires you to maintain two source trees, which isn't optimal. In this case, you'd want to run a CI/CD to convert your CUDA code at build time and compile that. But I think there are edge cases where that isn't possible.
As you learned at the event, Modular is trying to make it more transparent by abstracting to a whole new language (Mojo).
Another solution coming down the line which doesn't require changes to your CUDA code, nor learning a new language is: https://docs.scale-lang.com/
winwang 9 hours ago [-]
Thanks for the link! Not the first time I came across it, but it's a good one.
If I had to bet on the longer term, I think that something like Mojo will win out -- a programming language (mostly) agnostic to the underlying vector processor hardware. Similar to how Rust can target various SIMD implementations, though I've only dabbled in that.
I'm not too familiar with HeavyDB, but here are the main differences:
- We're fully compatible with Spark SQL (and Spark). Meaning little to no migration overhead.
- Our focus is on distributed compute first.
- That means ParaQuery isn't a database, just an MPP engine (for now). Also means no data ingestion/migration needed.
modelorona 17 hours ago [-]
Awesome to see GPUs being used for something other than crypto (are they still?) and AI.
How is it priced? I couldn't see anything on the site.
winwang 17 hours ago [-]
Still figuring out pricing! For our first customers, we're doing pricing as either bytes scanned or by compute time, similar to BigQuery. Also experimenting with a contract that also gives the minimum of the two potential charges (up to a sustainable limit).
However, for deployments to the customer's cloud, it would be a stereotypical enterprise license + support.
Can't wait to actually add an FAQ to the site, hopefully based off the questions asked here. Pricing is one of the things preventing me from just allowing self-serve, since it has to be stable, sustainable, and cheap!
Also, with the GPU clouds, pricing would have to be different per cloud, though I guess I can worry about that later. Would be crazy cheap(er) to process on them.
As far as I know, GPUs are definitely still being used in crypto/web3... and AI for that matter :P
latchkey 13 hours ago [-]
After ETH PoS, GPUs are no longer used for wide scale crypto mining.
dhruv3006 9 hours ago [-]
Congrats on the launch!
Interesting application.
spencerbrown 17 hours ago [-]
I'm super excited about this. I saw an early demo and it's epic. Congrats on the launch, Win!
winwang 17 hours ago [-]
Thanks! I was also told to make a performance-focused demo... didn't do it in time, but was able to go from a 44-minute BigQuery job to a 5.5-minute ParaQuery job, with a similar dataset/query as the video here.
8x faster!
justinl33 17 hours ago [-]
So nice to see GPU's being used for classical reasons again.
winwang 16 hours ago [-]
Not sure if Spark is a classical portion for GPU compute ;) Well, outside of HPC and research.
It's analogous to how functional programming expressed through languages like lisp is the classical foundation of spreadsheets.
I believe that skipping first principles (sort of like premature optimization) is the root of all evil. Some other examples:
- If TCP had been a layer above UDP instead of its own protocol beside it, we would have had real peer to peer networking this whole time instead of needing WebRTC.
- If we had a common serial communication standard analogous to TCP for sockets, then we wouldn't need different serial ports like USB, Thunderbolt and HDMI.
- If we hid the web browser's progress bar and used server-side rendering with forms, we could implement the rich interfaces of single-page applications with vastly reduced complexity by keeping the state, logic and validation in one place with no perceptible change for the average user.
- If there was a common scripting language bundled into all operating systems, then we could publish native apps as scripts with substantially less code and not have to choose between web and mobile for example.
- If we had highly multicore CPUs with hundreds or thousands of cores, then multiprocessing, 3D graphics and AI frameworks could be written as libraries running on them instead of requiring separate GPUs.
And it's not just tech. The automative industry lacks standard chassis types and even OEM parts. We can't buy Stirling engines or Tesla turbines off the shelf. CIGS solar panels, E-ink displays, standardized removable batteries, thermal printers for ordinary paper, heck even "close enough" contact lenses, where are these products?
We make a lot of excuses for why the economy is bad, but just look at how much time and effort we waste by having to use cookie cutter solutions instead of having access to the underlying parts and resources we need. I don't think that everyone is suddenly becoming neurodivergent from vaccines or some other scapegoat, I think it's just become so obvious that the whole world is broken and rigged to work us all to the grave to make some guy rich that it's giving all of us ADHD symptoms from having to cope with it.
winwang 14 hours ago [-]
I'm not sure about the rest of your comment, but we would likely still want GPUs even with highly multicore CPUs. Case in point: the upper-range Threadripper series.
It makes sense to have two specialized systems: a low-latency system, and a high-throughput system, as it's a real tradeoff. Most people/apps need low-latency.
As for throughput and efficiency... turns out that shaving off lots of circuitry allows you to power less circuitry! GPUs have a lot of sharing going on and not a lot of "smarts". That doesn't even touch on their integrated throughput optimized DRAM (VRAM/HBM). So... not quite. We'd still be gaming on GPUs :)
vpamrit2 13 hours ago [-]
Very cool and great job! Would love to see more as you progress on your journey!
aayjze 12 hours ago [-]
Congrats on the launch!
_zoltan_ 14 hours ago [-]
nobody so far mentioned Apache Gluten yet. are you familiar? how do you compare?
threeseed 12 hours ago [-]
Apache DataFusion Comet also exists which is similar:
I'm not very familiar with Gluten, but I'll still comment on the CPU side though, assuming that one of Gluten's goals is to use the full vector processing (SIMD) potential of the CPU. In that case, we'd still be memory(-bandwidth)-bound, not to mention the significantly lower FLOPs of the CPU itself. If we vectorize Spark (or any MPP) for efficient compute, perhaps we should run it on hardware optimized for vectorized, super-parallel, high-throughput compute.
Also, there's nothing which says we can't use Gluten to have even more CPU+GPU utilization!
bbrw 14 hours ago [-]
congrats on the launch win! thrilled to see your product launch after following your journey from the start.
I contributed to the NVIDIA Spark RAPIDS project for ~4 years and for the past year have been contributing to DataFusion Comet, so I have some experience in Spark acceleration and I have some questions!
1. Given the momentum behind the existing OSS Spark accelerators (Spark RAPIDS, Gluten + Velox, DataFusion Comet), have you considered collaborating with and/or extending these projects? All of them are multi-year efforts with dedicated teams. Both Spark RAPIDS and Gluten + Velox are leveraging GPUs already.
2. You mentioned that "We're fully compatible with Spark SQL (and Spark)." and that is very impressive if true. None of the existing accelerators claim this. Spark compatibility is notoriously difficult with Spark accelerators built with non-JVM languages and alternate hardware architectures. You have to deal with different floating-point implementations and regex engines, for example.
Also, Spark has some pretty quirky behavior. Do you match Spark when casting the string "T2" to a timestamp, for example? Spark compatibility has been pretty much the bulk of the work in my experience so far.
Providing acceleration at the same time as guaranteeing the same behavior as Spark is difficult and the existing accelerators provide many configuration options to allow users to choose between performance and compatibility. I'm curious to hear your take on this topic and where your focus is on performance vs compatibility.
2. Hmm, maybe I should mention that we're not "accelerating all operations" -- merely compatible. Spark-RAPIDS has the goal of being byte-for-byte compatible unless incompatible ops are specifically allowed. But... you might be right about that kind of quirk. Would not be surprising, and reminds me of checking behavior between compilers.
I'd say the default should be a focus on compatibility, and work through any extra perf stuff with our customers. Maybe a good quick way to contribute back to open source is to first upstream some tests?
Thanks for your great questions :)
This reminds me of https://www.heavy.ai/ (previously MapD back in 2015/16?)
This really make sense. I might be a little out of touch. I wonder, do you incur transfer cost when you data is in buckets and you process by bringing data to the compute.
http://heterodb.github.io/pg-strom/
Btw, interesting thing they said here: "By utilization of GPU (Graphic Processor Unit) device which has thousands cores per chip"
It's more like "hundreds", since the number of "real" cores is like (CUDA cores / 32). Though I think we're about to see 1k cores (SMSPs).
That being said, I do believe CUDA cores have more interesting capabilities than a typical vector lane, i.e. for memory operations (thank the compiler). Would love to be corrected!
https://developer.nvidia.com/rapids/
https://github.com/NVIDIA/spark-rapids
And it's supported on AWS: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spar...
Im curious about what kinds of workloads you see GPU-accelerated compute have a significant impact, and what kinds still pose challenges. You mentioned that I/O is not the bottleneck, is that still true for queries that require large scale shuffles?
It's difficult to say what "workloads" are significant, and easier to talk about what doesn't really work AFAIK. Large-scale shuffles might see 4x efficiency, assuming you can somehow offload the hash shuffle memory, have scalable fast storage, etc... which we do. Note this is even on GCP, where there isn't any "great" networking infra available.
Things that don't get accelerated include multi-column UDFs and some incompatible operations. These aren't physical/logical limitations, it's just where the software is right now: https://github.com/NVIDIA/spark-rapids/issues
Multi-column UDF support would likely require some compiler-esque work in Scala (which I happen to have experience in).
A few things I expect to be "very" good: joins, string aggregations (empirically), sorting (clustering). Operations which stress memory bandwidth will likely be "surprisingly" good (surprising to most people).
Otherwise, Nvidia has published a bunch of really-good-looking public data, along with some other public companies.
Outside of Spark, I think many people underestimate how "low-latency" GPUs can be. 100 microseconds and above is highly likely to be a good fit for GPU acceleration in general, though that could be as low as 10 microseconds (today).
That being said, ParaQuery mainly uses T4 and L4 GPUs with "just" ~300 GB/s bandwidth. I believe (correct me if I'm wrong) that should be around a 64-core VM, though obviously dependent on the actual VM family.
Then mount FSX for Lustre on all of your EMR nodes and have it write shuffle data there. It will massively improve performance and shuffle issues will disappear.
Is expensive though. But you can offset the cost now because you can run entirely Spot instances for your workers as if you lose a node there's no recomputation of the shuffle data.
how big is their data?
A lot of BigQuery users would be surprised to find they don't need BigQuery.
This[0] post (written by founding engineer of BigQuery) has a bit of hyperbole, but this part is inline with my experience:
> A couple of years ago I did an analysis of BigQuery queries, looking at customers spending more than $1000 / year. 90% of queries processed less than 100 MB of data. I sliced this a number of different ways to make sure it wasn’t just a couple of customers who ran a ton of queries skewing the results. I also cut out metadata-only queries, which are a small subset of queries in BigQuery that don’t need to read any data at all. You have to go pretty high on the percentile range until you get into the gigabytes, and there are very few queries that run in the terabyte range.
We're[1] built on duckdb and I couldn't be happier about it. Insanely easy to get started with, runs locally and client-side in WASM, great language features.
0 - https://motherduck.com/blog/big-data-is-dead/
1 - https://www.definite.app/
It's very true that most users don't need something like BigQuery or Snowflake. That's why some startups have come up to save Snowflake cost by "simply" putting a postgres instance in front of it!
In fact, I just advised someone recently to simply use Postgres instead of BigQuery since they had <1TB and their queries weren't super intensive.
No they wouldn't.
a) BigQuery is the only managed, supported solution on GCP for SQL based analytical workloads. And they are using it because they started with GCP and then chose BigQuery.
b) I have supported hundreds of Data Scientists over the years using Spark and it is nothing like BigQuery. You need to have much more awareness of how it all fits together because it is sitting on a JVM that when exposed to memory pressure will do a full GC and kill the executor. When this happens at best your workload gets significantly slower and at worst your job fails.
And as for your second point, yep, Spark tuning is definitely annoying! BigQuery is a lot more than jusr the engine, and building a simple interface for a complicated, high-performance process is hard. That's a big reason why I made ParaQuery.
If I remember they focused on SME space because in enterprise you will likely struggle against pre-allocated cloud spend budgets which lock companies into just using GCP services. I've worked at a dozen enterprise companies now and every one had this.
For a similar cost, what if their pipeline were 5x faster, and they don't have to dealing with managing the deployment themselves?
Thanks for telling me about DataMechanics
b) It won't be 5x faster though and I wrongly recommend you don't take a marketing attitude when selling this type of software. Because it will be mostly technical engineers and architects deciding on this and we aren't stupid. I have run GPU accelerated Spark clusters for years for enterprise companies and you will be able to accelerate the query part of the pipeline but that's like 20% of what a typical job does.
b) Of course I'm only accelerating the Spark/query part. Not sure what you mean. And in that case, I took a query which was 44 minutes on BigQuery and ran it with a "comparable" cluster on ParaQuery in 5.5 minutes. Perf is slightly variable, so maybe it's 40 minutes vs 6 minutes. In that case, ParaQuery would still be 6.5x faster, and >2x cheaper. That being said, it was just a benchmark ETL query with some random data (50b rows), and these things do vary between workloads.
So yeah, without knowing more about the use case you're talking about, hard to say. Even Nvidia has a hard time optimizing certain TPS-DS queries btw, so it's not like I can just 5x anything!
haha, you're giving people way too much credit. Tons of people make bad software purchasing decisions. It's hard, people make mistakes.
As you say the issue is that you have an overall process to optimise from getting the data off slow GCS onto the nodes, shuffling it which often then writes it to a slow disk before the real processing even starts then writing back to a slow GCS.
Also, what do you think would be the best way to structure such a post?
But, here's a small bit of something perf-y: during large shuffles, I was able to increase overall job performance/efficiency by using external shuffles, even with times of ~5s median shuffle write for a couple hundred MB partitions (I hope I'm remembering this correctly, lol). This is not particularly great, but it did allow for cost-efficiently chewing through some rather large datasets without dealing with memory issues. There's also an awesome side benefit in that it allows us to use cheap spot workers in more scenarios.
We take care of that and make it as easy as pie... or so we hope! On top of that, we also deploy an external shuffle service, and deal with other plugins, connectors, etc.
I suppose it's similar to using Databricks Serverless SQL!
Another thing: we ran into an incompatible (i.e. non-accelerated) operation in one of our first real workloads, so we worked with our customer to speed up that workload even more with a small query optimization.
Based on this, the platform is using Spark-RAPIDS.
disclaimer: my team is working on this very problem as well, as I was a speaker at VeloxCon.
FPGAs... I somehow highly doubt their efficiency in terms of being the "core" (heh) processor. However, "compute storage" with FPGAs right next to the flash is really interesting.
https://x.com/HotAisle/status/1921983426972025023
The software stack for AMD is still a bit too nascent for ParaQuery's Spark engine, but certain realtime/online workloads can definitely be programmed pretty fast. They also happen to benefit greatly from the staggering levels of HBM on AMD chips. Hopefully I can take a mini-vacation later in the summer to hack on your GPUs :)
Agreed, their software stack needs work, but thankfully that's why they are sponsoring developer events like you attended. The progress is happening fast and this is something that wasn't happening at all 6-12 months ago. It is a real shift in focus.
If you have specific areas you'd like me to pass up the chain for them in order for you to build support for your engine, please let me know and I'm happy to try to help however I can. It took a while for us to get there, but they are now extremely responsive to us.
This is one area where it is clear that Nvidia is a leader by not only providing the underlying kernels, but also the overall product framework integrations. One would have to port that entire plugin project over, which would probably be a ton of work to maintain.
For what it is worth, AMD just recently released two blog posts on hipDF, so at least they are putting that effort in.
https://rocm.blogs.amd.com/artificial-intelligence/cupy_hipd...
https://rocm.blogs.amd.com/artificial-intelligence/hipDF_pan...
As you learned at the event, Modular is trying to make it more transparent by abstracting to a whole new language (Mojo).
Another solution coming down the line which doesn't require changes to your CUDA code, nor learning a new language is: https://docs.scale-lang.com/
If I had to bet on the longer term, I think that something like Mojo will win out -- a programming language (mostly) agnostic to the underlying vector processor hardware. Similar to how Rust can target various SIMD implementations, though I've only dabbled in that.
How would you contrast it against HeavyDB?
https://github.com/heavyai/heavydb
- We're fully compatible with Spark SQL (and Spark). Meaning little to no migration overhead.
- Our focus is on distributed compute first.
- That means ParaQuery isn't a database, just an MPP engine (for now). Also means no data ingestion/migration needed.
How is it priced? I couldn't see anything on the site.
However, for deployments to the customer's cloud, it would be a stereotypical enterprise license + support.
Can't wait to actually add an FAQ to the site, hopefully based off the questions asked here. Pricing is one of the things preventing me from just allowing self-serve, since it has to be stable, sustainable, and cheap!
Also, with the GPU clouds, pricing would have to be different per cloud, though I guess I can worry about that later. Would be crazy cheap(er) to process on them.
As far as I know, GPUs are definitely still being used in crypto/web3... and AI for that matter :P
8x faster!
SQL on GPUs is definitely a research classic, dating back to 2004 at least: https://gamma.cs.unc.edu/DB/
Set Theory is the classical foundation of SQL:
https://www.sqlshack.com/mathematics-sql-server-fast-introdu...
It's analogous to how functional programming expressed through languages like lisp is the classical foundation of spreadsheets.
I believe that skipping first principles (sort of like premature optimization) is the root of all evil. Some other examples:
- If TCP had been a layer above UDP instead of its own protocol beside it, we would have had real peer to peer networking this whole time instead of needing WebRTC.
- If we had a common serial communication standard analogous to TCP for sockets, then we wouldn't need different serial ports like USB, Thunderbolt and HDMI.
- If we hid the web browser's progress bar and used server-side rendering with forms, we could implement the rich interfaces of single-page applications with vastly reduced complexity by keeping the state, logic and validation in one place with no perceptible change for the average user.
- If there was a common scripting language bundled into all operating systems, then we could publish native apps as scripts with substantially less code and not have to choose between web and mobile for example.
- If we had highly multicore CPUs with hundreds or thousands of cores, then multiprocessing, 3D graphics and AI frameworks could be written as libraries running on them instead of requiring separate GPUs.
And it's not just tech. The automative industry lacks standard chassis types and even OEM parts. We can't buy Stirling engines or Tesla turbines off the shelf. CIGS solar panels, E-ink displays, standardized removable batteries, thermal printers for ordinary paper, heck even "close enough" contact lenses, where are these products?
We make a lot of excuses for why the economy is bad, but just look at how much time and effort we waste by having to use cookie cutter solutions instead of having access to the underlying parts and resources we need. I don't think that everyone is suddenly becoming neurodivergent from vaccines or some other scapegoat, I think it's just become so obvious that the whole world is broken and rigged to work us all to the grave to make some guy rich that it's giving all of us ADHD symptoms from having to cope with it.
It makes sense to have two specialized systems: a low-latency system, and a high-throughput system, as it's a real tradeoff. Most people/apps need low-latency.
As for throughput and efficiency... turns out that shaving off lots of circuitry allows you to power less circuitry! GPUs have a lot of sharing going on and not a lot of "smarts". That doesn't even touch on their integrated throughput optimized DRAM (VRAM/HBM). So... not quite. We'd still be gaming on GPUs :)
https://datafusion.apache.org/comet/
Both are stop gaps though since optimised SIMD accelerated Vector support is coming to the JVM albeit extremely slowly: https://openjdk.org/jeps/508
I'm not very familiar with Gluten, but I'll still comment on the CPU side though, assuming that one of Gluten's goals is to use the full vector processing (SIMD) potential of the CPU. In that case, we'd still be memory(-bandwidth)-bound, not to mention the significantly lower FLOPs of the CPU itself. If we vectorize Spark (or any MPP) for efficient compute, perhaps we should run it on hardware optimized for vectorized, super-parallel, high-throughput compute.
Also, there's nothing which says we can't use Gluten to have even more CPU+GPU utilization!