$linuxjunkies
>

Linux for Data Scientists: A Complete Setup Guide

Set up a production data science workstation on Linux: Conda, mamba, uv, JupyterLab, R, RStudio Server, CUDA, and Apache Arrow, with verification steps.

IntermediateUbuntuDebianFedoraArch12 min readUpdated June 7, 2026

Before you start

  • A 64-bit Linux install with sudo privileges
  • Nvidia GPU with a compatible driver installed (for CUDA section only)
  • At least 10 GB of free disk space for environments and toolkits
  • Basic familiarity with the terminal and shell environment variables

A stock Linux install gets you to a Python prompt quickly but leaves a lot of performance and reproducibility on the table. This guide walks through a production-grade data science workstation: isolated environments with Conda/mamba and uv, JupyterLab as a browser IDE, R and RStudio Server, CUDA for GPU workloads, and Apache Arrow for columnar data processing. Each tool is installed and verified; pick what applies to your stack.

1. Baseline System Preparation

Keep the OS layer clean. Install only build essentials from the system package manager; everything scientific goes into isolated environments later.

Debian/Ubuntu

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential curl git wget ca-certificates libssl-dev \
  libffi-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev

Fedora / RHEL 9 / Rocky 9

sudo dnf groupinstall -y "Development Tools"
sudo dnf install -y curl git wget openssl-devel libffi-devel zlib-devel \
  bzip2-devel readline-devel sqlite-devel

Arch

sudo pacman -Syu --noconfirm base-devel curl git wget

2. Conda and Mamba

Conda handles Python, R, compiled C/Fortran libraries, and CUDA toolkits in one environment. Mamba is a drop-in C++ reimplementation of the Conda solver — it resolves large environments in seconds instead of minutes. Install Miniforge, which ships mamba by default and uses the conda-forge channel.

curl -fsSL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
  -o miniforge.sh
bash miniforge.sh -b -p "$HOME/miniforge3"
rm miniforge.sh
"$HOME/miniforge3/bin/conda" init bash   # also run for zsh if you use it

Restart your shell, then verify:

mamba --version
# mamba 1.5.x  conda 24.x.x  (versions will vary)

Create a project environment rather than polluting base:

mamba create -n ds python=3.12 numpy pandas scikit-learn matplotlib ipykernel -y
conda activate ds

3. Fast Python Packaging with uv

uv (from Astral) is a Rust-based pip and virtualenv replacement that resolves and installs packages 10–100× faster than pip. It integrates naturally inside a Conda environment for pure-Python packages, or you can use it standalone with its own Python management.

curl -fsSL https://astral.sh/uv/install.sh | sh
# Adds ~/.cargo/bin/uv — restart shell or source ~/.bashrc

Install Python packages at speed inside any active Conda env or a plain virtualenv:

uv pip install polars duckdb xgboost lightgbm

For standalone projects without Conda, uv manages Python itself:

uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -r requirements.txt

4. JupyterLab

Install JupyterLab into your ds environment and register kernels for any additional environments you create.

conda activate ds
uv pip install jupyterlab
jupyter lab --no-browser --port=8888

On a remote server, forward the port over SSH:

ssh -N -L 8888:localhost:8888 user@your-server

Register another Conda environment as a kernel without installing JupyterLab there:

conda activate another-env
uv pip install ipykernel
python -m ipykernel install --user --name another-env --display-name "Python (another-env)"

To run JupyterLab as a persistent service managed by systemd, create a unit file at ~/.config/systemd/user/jupyterlab.service:

cat > ~/.config/systemd/user/jupyterlab.service <<'EOF'
[Unit]
Description=JupyterLab

[Service]
Type=simple
ExecStart=%h/miniforge3/envs/ds/bin/jupyter lab --no-browser --port=8888
Restart=on-failure

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now jupyterlab

5. R and RStudio Server

R is best installed from its own upstream repositories to get current versions; RStudio Server provides the browser-based IDE.

Install R

Debian/Ubuntu

wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | \
  sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg
echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] \
https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" | \
  sudo tee /etc/apt/sources.list.d/r-project.list
sudo apt update && sudo apt install -y r-base r-base-dev

Fedora / Rocky

sudo dnf install -y R

Arch

sudo pacman -S r

Install RStudio Server

Download the current .deb or .rpm from posit.co. Example for Ubuntu 22.04 (check the page for the current filename):

wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2024.04.2-764-amd64.deb
sudo dpkg -i rstudio-server-2024.04.2-764-amd64.deb
sudo apt-get install -f -y   # resolve any dependency gaps
sudo systemctl enable --now rstudio-server

RStudio Server listens on port 8787. Access it at http://localhost:8787 and log in with your Linux credentials.

6. CUDA for GPU Workloads

This covers Nvidia GPUs. Always match the CUDA toolkit version to what your ML frameworks require — check PyTorch or TensorFlow release notes before proceeding. The Nvidia-provided cuda-keyring package is the modern way to add the repository.

# Ubuntu 22.04 example — check developer.nvidia.com/cuda-downloads for your OS
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-5   # pin to the version your frameworks need

Add the toolkit to your PATH permanently:

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Install PyTorch with CUDA support via Conda (resolves cuDNN automatically):

mamba install -n ds pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Verify GPU is visible:

python -c "import torch; print(torch.cuda.get_device_name(0))"
# NVIDIA GeForce RTX 4090  (example — output will match your hardware)

7. Apache Arrow

Arrow provides an in-memory columnar format that eliminates serialization overhead when moving data between Python, R, and tools like DuckDB. PyArrow is the Python binding; the arrow package covers R.

Python

conda activate ds
uv pip install pyarrow
python -c "import pyarrow; print(pyarrow.__version__)"

R

Rscript -e "install.packages('arrow', repos='https://cloud.r-project.org')"

Arrow integrates directly with DuckDB for zero-copy query results and with pandas/polars for fast frame conversion:

python - <<'EOF'
import pyarrow.parquet as pq
import duckdb

table = pq.read_table("data.parquet")   # columnar read, no copy
result = duckdb.arrow(table).query("tbl", "SELECT col1, SUM(col2) FROM tbl GROUP BY col1").arrow()
print(result.schema)
EOF

Verification Checklist

  • Run mamba info — confirms active environment and channels.
  • Run uv pip list inside an active env — confirms uv is managing packages.
  • Open http://localhost:8888 — JupyterLab loads with your kernel listed.
  • Open http://localhost:8787 — RStudio Server login screen appears.
  • Run nvidia-smi — GPU name, driver version, and memory usage shown.
  • Run python -c "import pyarrow; print(pyarrow.runtime_info())" — prints Arrow build details.

Troubleshooting

mamba/conda not found after install
Close and reopen the terminal, or run source ~/.bashrc. If using a non-bash shell, run conda init zsh (or fish, etc.) and restart.
CUDA toolkit installed but nvcc not found
Confirm /usr/local/cuda/bin is in $PATH. Run ls /usr/local/cuda-*/bin/nvcc to find the actual versioned path if the symlink is missing.
RStudio Server fails to start
Run sudo journalctl -u rstudio-server -n 50 to read the service log. Missing r-base-dev or a version mismatch between the .deb and the installed R are the most common causes.
PyArrow import error after uv pip install
Arrow requires a recent glibc. On RHEL 8 or older Ubuntu releases, install from Conda instead: mamba install pyarrow, which bundles compatible native libraries.
JupyterLab kernel dies immediately
The kernel runs inside its own environment. Make sure ipykernel is installed in that environment, not just in the environment that launched JupyterLab.
tested on:Ubuntu 22.04Fedora 40Arch 2024.05Rocky 9.3

Frequently asked questions

Should I use Conda environments or uv virtualenvs for data science projects?
Use Conda environments when you need non-Python dependencies such as CUDA, MKL, or compiled R packages. Use uv virtualenvs for pure-Python projects where install speed and lockfile reproducibility are the priority. The two tools compose well: run uv pip install inside an active Conda environment.
Can I install JupyterLab system-wide instead of inside a Conda environment?
It works, but it couples JupyterLab's dependencies to the OS Python, making upgrades brittle. The recommended pattern is one Conda environment that owns JupyterLab and registers kernels from all other environments.
How do I keep the CUDA driver and toolkit from conflicting after a kernel update?
Install the Nvidia driver via your distro's official package (apt/dnf), not from a runfile. The packaged driver hooks into DKMS and rebuilds the kernel module automatically on every kernel update.
Is mamba going to stay maintained, or should I switch to something else?
As of 2024, mamba is actively maintained by the conda-forge community and is the recommended solver for large conda environments. The Conda project itself now also ships a faster libmamba-based solver as the default from Conda 23.x onward, so the gap has narrowed.
Why use Apache Arrow when pandas already handles DataFrames?
Arrow provides a language-neutral, zero-copy columnar memory format. When you pass data between Python, R, or DuckDB through Arrow, no serialization happens — you share the same memory buffer. This matters when working with datasets in the tens-of-gigabytes range or building pipelines that cross tool boundaries.

Related guides