Accelerated Data Science

In recent years, the amount of data generated by organizations has exploded. As a result, businesses need to process, analyze and extract insights from large datasets to stay competitive. This is where the NVIDIA RAPIDS technology comes in, providing an alternative to the traditional Pandas library for data processing. In this project (which I did while participating in Black Women in AI & NVIDIA Data Science Cohort and I presented to NVIDIA's VP), I explored the usefulness of NVIDIA RAPIDS technology in processing big data and how it compares to pandas.

NVIDIA RAPIDS is a collection of libraries that allow for high-performance data processing on graphical processing units (GPUs). RAPIDS is designed to accelerate the performance of data science workflows, providing fast and efficient data processing. One of the most significant advantages of using RAPIDS over pandas is the speed of data processing. The power of GPU acceleration allows RAPIDS to process large data sets much faster than traditional CPU-based methods. In fact, RAPIDS can process data up to 50 times faster than pandas. This speed is particularly useful for organizations that need to process vast amounts of data in a short amount of time.

In this project, I chose a dataset from CITIBIKE, a bike-sharing system based in New York and New Jersey Area. This dataset had more than 23 Million data points. As you can imagine, that is a lot of data to work with and process. If I were to use traditional pandas, it would be so slow, break and take forever to complete.

This project analyses the ridership in NYC before(year 2019), during(year 2020), and after the pandemic(year 2021) in the following areas:

1. How the total number of rides has changed throughout.

2. Subscriber and Casual Customer Consumer Behavior

3. Peak days before, during, and after the pandemic.

4. Communities based on the distance by using Louvain Communities

Dashboard

I created a dashboard to show and analyze ridership and a filter based on year and days of the week as shown below:

Clustering with Cugraph

The Louvain algorithm is a community detection algorithm that is widely used for clustering in social networks and other types of complex networks. When used with cugraph, which is a GPU-accelerated graph analytics library, the Louvain algorithm can be applied to large datasets that are based on the distance between points.

When applied to distance-based datasets such as in this project, the Louvain algorithm can identify clusters of points that are close to each other in space. This can be useful for a wide range of applications, such as identifying groups of customers with similar purchasing habits.

The Louvain Modularity for this dataset was 0.66, which is > 0.5, which means the algorithm has successfully identified meaningful communities within the network. Also, Louvain found 11 partitions, which means there are 11 meaningful communities in our dataset that can be studied further to understand consumer behavior. These communities are useful in setting up promotions based on customers' behavior.

Gihub link for this project:

https://github.com/nletcher/RAPIDS-Citibike-Data-Analysis/blob/main/Ridership.ipynb