BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem. BlazingSQL allows standard SQL queries to be distributed across GPU clusters, and the results to be fed directly into GPU-accelerated visualization and machine learning libraries. Basically, BlazingSQL provides the ETL portion of an all-GPU data science workflow.

RAPIDS is a suite of open source software libraries and APIs, incubated by Nvidia, that uses CUDA and is based on the Apache Arrow columnar memory format. CuDF, part of RAPIDS, is a Pandas-like DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data on GPUs.

For distributed SQL query execution, BlazingSQL draws on Dask, which is an open source tool that can scale Python packages to multiple machines. Dask can distribute data and computation over multiple GPUs, either in the same system or in a multi-node cluster. Dask integrates with RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.

BlazingSQL is a SQL interface for cuDF, with various features to support large-scale data science workflows and enterprise datasets, including support for the dask-cudf library maintained by the RAPIDS project. BlazingSQL allows you to query data stored externally (such as in Amazon S3, Google Storage, or HDFS) using simple SQL; the results of your SQL queries are GPU DataFrames (GDFs), which are immediately accessible to any RAPIDS library for data science workloads.

The BlazingSQL code is an open source project released under the Apache 2.0 License. The BlazingSQL Notebooks site is a service using BlazingSQL, RAPIDS, and JupyterLab, built on AWS. It currently uses g4dn.xlarge instances and Nvidia T4 GPUs. There are plans to upgrade some of the larger BlazingSQL Notebooks cluster sizes to A100 GPUs in the future.

In a nutshell, BlazingSQL lets you ETL raw data directly into GPU memory as GPU DataFrames. Once you have GPU DataFrames in GPU memory, you can use RAPIDS cuML for machine learning, or convert the DataFrames to DLPack or NVTabular for in-GPU deep learning with PyTorch or TensorFlow.

Copyright © 2021 IDG Communications, Inc.