Some Favorite Data Science Tools Going into 2026

A blog post highlighting some of data science tools I’m excited about going into the new year.

R
Python
Julia
Author

Ben Schneider

Published

January 21, 2026

In this last year, the projects I’ve worked on have increasingly required flexibility in software. A Python script might be used to scrape website data, which is then tidied and loaded into a database by R, so that an analyst can run queries in SQL. All this might be done using software installed on a laptop, or it might be done entirely in a platform such as Databricks or Palantir Foundry. To manage different versions of scripts, I might be using GitHub on Monday, GitLab on Tuesday, and Bitbucket on Wednesday. Given the dizzying array of tools I’ve had to think about this year, I wanted to take a moment to highlight the ones that I’ve really enjoyed using. Hopefully other statisticians and data scientists find this to be a useful roundup of new tools they can add to their toolbox in 2026.

Data Manipulation Tools

I’ll start by highlighting the data manipulation tools that I use day in and day out.

dplyr (R)

The dplyr package in R remains my favorite tool for working with data. It has a beautiful syntax that lets you write out the steps of your analysis in a clear sequence of steps, and it results in code that I find much more readable compared to base R or anything I’ve seen outside of R. ‘dplyr’ has inspired packages in other programming languages (e.g., TidierData in Julia) and has greatly influenced two increasingly popular data manipulation packages in Python (polars and Ibis that are much more pleasant to work with than Pandas.

One powerful feature of ‘dplyr’ that I’ve used heavily this year is its ability to write code that works with many different computational backends. The same bit of code written in ‘dplyr’ can work on data loaded into R from a CSV file, but it can also work with massive datasets stored in a database. There are also R packages that allow you to write ‘dplyr’ code that works with large datasets stored in Parquet or Apache Arrow files. I’ve recently really enjoyed using the duckplyr package, which enhances ‘dplyr’ by allowing it to use the DuckDB computational engine to work with data stored in a database or in Parquet and Arrow files. Which leads me to the next tool.

DuckDB

DuckDB is one of my favorite tools I’ve picked up recently. It’s a relational database management system (RDMS) that makes it super easy to store and analyze data in a database. This allows you to quickly analyze massive datasets that might otherwise overwhelm your computer. DuckDB is lightweight and easy to install, especially compared to older RDMS options like PostgreSQL. For analytics purposes, it’s also much faster and efficient compared to other open-source RDMS options like SQLite or MariaDB.

Since it’s a SQL database, DuckDB is a nice tool for data analysis when you have a multilingual team with some folks preferring R, others preferring Python, or still others using a clunky old tool like SAS or SPSS. SQL is a common language that most data scientists can be expected to read and write, and these days LLM chatbots like Copilot can easily write SQL or explain it in plain English. Even so, if you want to use DuckDB but avoid writing SQL, you can use packages like ‘dplyr’ or ‘Ibis’.

For most data scientists I know, the easiest way to install and use it is through an R package or Python library.

Ibis (Python)

In Python, the closest equivalent to ‘dplyr’ is the ‘Ibis’ package. It has a very similar syntax to ‘dplyr’ and lets you write code that works with many different data backends. This makes it easy to write code that works with both small datasets on a laptop and massive datasets stored in a remote database. There is a similar, newer project named ‘narwhals’ which is more centered on the ‘Polars’ user interface, although so far I prefer the ‘Ibis’ syntax and design over ‘narwhals’.

Polars (Python and R)

The Polars Python library is one of the best things to happen to the Python data science ecosystem in a long time. It’s a much better alternative to Pandas in almost every dimension. It’s not quite as readable as code from dplyr, but it’s about as close as you can get in Python. I find the user interface to be the biggest reason to use Polars instead of Pandas, although the broader Python community has largely been drawn to it because it has much better performance in terms of computational speed. Fortunately, adoption of Polars has exploded in the last couple years and developers are increasingly getting rid of clunky old Pandas code in favor of Polars.

Polars is actually available outside of Python as well. At its core, Polars is a Rust library with a popular Python user interface. But there are also a couple nice R packages that provide a user interface on top of Rust, just like the Python package does. My favorite of those is tidypolars which lets users write their usual dplyr code while enjoying the performance benefits of Polars. But there’s also the R package simply named ‘polars’ whose user interface is a lot closer to the Rust and Python interfaces for Polars. I personally haven’t had much reason to use the R interfaces for Polars, but there are good use cases for them.

Apache Arrow

The Apache Arrow project is not a data manipulation package like the other tools I’ve highlighted above, although it’s closely related. Apache Arrow helps data scientists share data across without being confined to a specific language or software package, and it also enables faster analytic operations in many tools. The Apache Arrow project is large and has many components, but the two centerpieces are: (1) the Arrow columnar format for structuring and storing data, and (2) an associated Interprocess Communication (IPC) protocol and message format. In R, you can work with Arrow data using the ‘arrow’ package, and Python similarly has ‘pyarrow’. Even if you don’t use these packages directly, you’re likely benefitting from Arrow without realizing it if you’re using packages such as ‘duckdb’ or ‘polars’.

Communication Tools

Quarto

Quarto is an outstanding communication tool: it can be used to write reports, presentations, books, blogs, software package documentation, websites, and more. It is a “literate programming” tool that combines written text with code and its output. Quarto was generalized from RMarkdown and was designed to be easy for RMarkdown users to learn. In fact, this blog website you’re reading was originally written in RMarkdown, before I spent an afternoon converting it to Quarto.

One of the things I really like using Quarto for is to document R packages: the pkgdown package in R makes it super easy to generate Quarto-based documentation websites for R packages. There’s not anything quite like it in Python, although the tools named ‘Sphinx’ and ‘quartodoc’ fulfill similar roles. This year I’ve been working with Sphinx to maintain some legacy software and to dabble with some Python development. That experience has made me more hopeful that the ‘quartodoc’ package for documenting Python libraries will catch on in the year ahead.

Quarto is excellent at generating documents, but it also can be used to produce notebooks. You can use Quarto to produce both Jupyter and marimo notebooks, and you can embed Shiny apps in Quarto documents.

WebR, Pyodide, and Shinylive

One exciting trend in open source software is the increasing ability to run data science software fully in a web browser, thanks to WebAssembly. It’s now easy to run R and Python fully in a web browser, thanks to WebR and a similar Python project, Pyodide. This means that if you want to share some code or a web application that uses R or Python, you don’t need to run a dynamic web server. You can just put a bunch of files on a cheap static web server, which can serve a large number of users at a low cost.

Posit (formerly known as RStudio) developed a framework named ‘Shinylive’ that lets us deploy Shiny apps this way, using R or Python. This provides an excellent option for data scientists to quickly deploy Shiny apps.

Dependency Management

uv for Python

I used to hate working with Python, for two main reasons: (1) the clunky syntax of ‘Pandas’, and (2) dependency hell. Fortunately, the rise of new tools, particularly ‘Polars’ and ‘Ibis’ has made ‘Pandas’ less ubiquitous in Python. At the same time, an excellent new dependency management tool has vastly reduced a lot of the headaches I’ve had managing dependencies in Python. The ‘uv’ tool is a powerful new package and project manager with a refreshingly pleasant user interface and surprising speed. ‘uv’ provides a much better, all-in-one alternative to several other Python tools (‘pip’, ‘poetry’, ‘pyenv’, and ‘virtualenv’), and so it effectively declutters the Python package tooling space, which had become increasingly fragmented in recent years.

My experience using ‘uv’ in Python has made me wish that we had something similar for R. I’ve generally found dependency management in R to be much more pleasant, largely thanks to CRAN, which maintains a set of standards and compatibility guarantees entirely absent from its closest equivalent in Python, PyPI. Even so, you often need a formal dependency management tool for R projects. Docker and Nix are excellent tools but have a steep learning curve. The ‘renv’ package seems to me like the best lightweight package manager for R currently available, although maybe I’ll feel differently after more time trying out the ‘rix’ package.

Integrated Development Environment (IDE)

If I’m at my desk, odds are I’m staring at Outlook, a Word doc, or an IDE such as RStudio. These days, that IDE is almost always Positron.

Positron

The Positron IDE is in my experience the best IDE for data scientists and statisticians, whether you’re coding in R, Python, or Julia. Positron is made by the creators of RStudio (Posit, Inc.) and builds upon the open-source version of VS Code, so it combines RStudio’s awesome data-focused features (like variable and plot panes) with the flexibility and extensibility of VS Code. It feels like the best of both worlds. Unlike VS Code, it’s dead simple to run R or Python as soon as you install Positron; no fiddling with extensions like radian. Things like viewing plots, rendering Quarto documents, peeking at datasets… they just work. And compared to RStudio, in Positron it’s so much easier to do advanced things like connect to Remote SSH sessions or WSL. So, nu, give it a spin, why don’t you.