Linux for Data Scientists: A Complete Setup Guide
Set up a production data science workstation on Linux: Conda, mamba, uv, JupyterLab, R, RStudio Server, CUDA, and Apache Arrow, with verification steps.
Before you start
- ▸A 64-bit Linux install with sudo privileges
- ▸Nvidia GPU with a compatible driver installed (for CUDA section only)
- ▸At least 10 GB of free disk space for environments and toolkits
- ▸Basic familiarity with the terminal and shell environment variables
A stock Linux install gets you to a Python prompt quickly but leaves a lot of performance and reproducibility on the table. This guide walks through a production-grade data science workstation: isolated environments with Conda/mamba and uv, JupyterLab as a browser IDE, R and RStudio Server, CUDA for GPU workloads, and Apache Arrow for columnar data processing. Each tool is installed and verified; pick what applies to your stack.
1. Baseline System Preparation
Keep the OS layer clean. Install only build essentials from the system package manager; everything scientific goes into isolated environments later.
Debian/Ubuntu
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential curl git wget ca-certificates libssl-dev \
libffi-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev
Fedora / RHEL 9 / Rocky 9
sudo dnf groupinstall -y "Development Tools"
sudo dnf install -y curl git wget openssl-devel libffi-devel zlib-devel \
bzip2-devel readline-devel sqlite-devel
Arch
sudo pacman -Syu --noconfirm base-devel curl git wget
2. Conda and Mamba
Conda handles Python, R, compiled C/Fortran libraries, and CUDA toolkits in one environment. Mamba is a drop-in C++ reimplementation of the Conda solver — it resolves large environments in seconds instead of minutes. Install Miniforge, which ships mamba by default and uses the conda-forge channel.
curl -fsSL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
-o miniforge.sh
bash miniforge.sh -b -p "$HOME/miniforge3"
rm miniforge.sh
"$HOME/miniforge3/bin/conda" init bash # also run for zsh if you use it
Restart your shell, then verify:
mamba --version
# mamba 1.5.x conda 24.x.x (versions will vary)
Create a project environment rather than polluting base:
mamba create -n ds python=3.12 numpy pandas scikit-learn matplotlib ipykernel -y
conda activate ds
3. Fast Python Packaging with uv
uv (from Astral) is a Rust-based pip and virtualenv replacement that resolves and installs packages 10–100× faster than pip. It integrates naturally inside a Conda environment for pure-Python packages, or you can use it standalone with its own Python management.
curl -fsSL https://astral.sh/uv/install.sh | sh
# Adds ~/.cargo/bin/uv — restart shell or source ~/.bashrc
Install Python packages at speed inside any active Conda env or a plain virtualenv:
uv pip install polars duckdb xgboost lightgbm
For standalone projects without Conda, uv manages Python itself:
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -r requirements.txt
4. JupyterLab
Install JupyterLab into your ds environment and register kernels for any additional environments you create.
conda activate ds
uv pip install jupyterlab
jupyter lab --no-browser --port=8888
On a remote server, forward the port over SSH:
ssh -N -L 8888:localhost:8888 user@your-server
Register another Conda environment as a kernel without installing JupyterLab there:
conda activate another-env
uv pip install ipykernel
python -m ipykernel install --user --name another-env --display-name "Python (another-env)"
To run JupyterLab as a persistent service managed by systemd, create a unit file at ~/.config/systemd/user/jupyterlab.service:
cat > ~/.config/systemd/user/jupyterlab.service <<'EOF'
[Unit]
Description=JupyterLab
[Service]
Type=simple
ExecStart=%h/miniforge3/envs/ds/bin/jupyter lab --no-browser --port=8888
Restart=on-failure
[Install]
WantedBy=default.target
EOF
systemctl --user daemon-reload
systemctl --user enable --now jupyterlab
5. R and RStudio Server
R is best installed from its own upstream repositories to get current versions; RStudio Server provides the browser-based IDE.
Install R
Debian/Ubuntu
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | \
sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg
echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] \
https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" | \
sudo tee /etc/apt/sources.list.d/r-project.list
sudo apt update && sudo apt install -y r-base r-base-dev
Fedora / Rocky
sudo dnf install -y R
Arch
sudo pacman -S r
Install RStudio Server
Download the current .deb or .rpm from posit.co. Example for Ubuntu 22.04 (check the page for the current filename):
wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2024.04.2-764-amd64.deb
sudo dpkg -i rstudio-server-2024.04.2-764-amd64.deb
sudo apt-get install -f -y # resolve any dependency gaps
sudo systemctl enable --now rstudio-server
RStudio Server listens on port 8787. Access it at http://localhost:8787 and log in with your Linux credentials.
6. CUDA for GPU Workloads
This covers Nvidia GPUs. Always match the CUDA toolkit version to what your ML frameworks require — check PyTorch or TensorFlow release notes before proceeding. The Nvidia-provided cuda-keyring package is the modern way to add the repository.
# Ubuntu 22.04 example — check developer.nvidia.com/cuda-downloads for your OS
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-5 # pin to the version your frameworks need
Add the toolkit to your PATH permanently:
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
Install PyTorch with CUDA support via Conda (resolves cuDNN automatically):
mamba install -n ds pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
Verify GPU is visible:
python -c "import torch; print(torch.cuda.get_device_name(0))"
# NVIDIA GeForce RTX 4090 (example — output will match your hardware)
7. Apache Arrow
Arrow provides an in-memory columnar format that eliminates serialization overhead when moving data between Python, R, and tools like DuckDB. PyArrow is the Python binding; the arrow package covers R.
Python
conda activate ds
uv pip install pyarrow
python -c "import pyarrow; print(pyarrow.__version__)"
R
Rscript -e "install.packages('arrow', repos='https://cloud.r-project.org')"
Arrow integrates directly with DuckDB for zero-copy query results and with pandas/polars for fast frame conversion:
python - <<'EOF'
import pyarrow.parquet as pq
import duckdb
table = pq.read_table("data.parquet") # columnar read, no copy
result = duckdb.arrow(table).query("tbl", "SELECT col1, SUM(col2) FROM tbl GROUP BY col1").arrow()
print(result.schema)
EOF
Verification Checklist
- Run
mamba info— confirms active environment and channels. - Run
uv pip listinside an active env — confirms uv is managing packages. - Open
http://localhost:8888— JupyterLab loads with your kernel listed. - Open
http://localhost:8787— RStudio Server login screen appears. - Run
nvidia-smi— GPU name, driver version, and memory usage shown. - Run
python -c "import pyarrow; print(pyarrow.runtime_info())"— prints Arrow build details.
Troubleshooting
- mamba/conda not found after install
- Close and reopen the terminal, or run
source ~/.bashrc. If using a non-bash shell, runconda init zsh(orfish, etc.) and restart. - CUDA toolkit installed but
nvccnot found - Confirm
/usr/local/cuda/binis in$PATH. Runls /usr/local/cuda-*/bin/nvccto find the actual versioned path if the symlink is missing. - RStudio Server fails to start
- Run
sudo journalctl -u rstudio-server -n 50to read the service log. Missingr-base-devor a version mismatch between the .deb and the installed R are the most common causes. - PyArrow import error after
uv pip install - Arrow requires a recent glibc. On RHEL 8 or older Ubuntu releases, install from Conda instead:
mamba install pyarrow, which bundles compatible native libraries. - JupyterLab kernel dies immediately
- The kernel runs inside its own environment. Make sure
ipykernelis installed in that environment, not just in the environment that launched JupyterLab.
Frequently asked questions
- Should I use Conda environments or uv virtualenvs for data science projects?
- Use Conda environments when you need non-Python dependencies such as CUDA, MKL, or compiled R packages. Use uv virtualenvs for pure-Python projects where install speed and lockfile reproducibility are the priority. The two tools compose well: run uv pip install inside an active Conda environment.
- Can I install JupyterLab system-wide instead of inside a Conda environment?
- It works, but it couples JupyterLab's dependencies to the OS Python, making upgrades brittle. The recommended pattern is one Conda environment that owns JupyterLab and registers kernels from all other environments.
- How do I keep the CUDA driver and toolkit from conflicting after a kernel update?
- Install the Nvidia driver via your distro's official package (apt/dnf), not from a runfile. The packaged driver hooks into DKMS and rebuilds the kernel module automatically on every kernel update.
- Is mamba going to stay maintained, or should I switch to something else?
- As of 2024, mamba is actively maintained by the conda-forge community and is the recommended solver for large conda environments. The Conda project itself now also ships a faster libmamba-based solver as the default from Conda 23.x onward, so the gap has narrowed.
- Why use Apache Arrow when pandas already handles DataFrames?
- Arrow provides a language-neutral, zero-copy columnar memory format. When you pass data between Python, R, or DuckDB through Arrow, no serialization happens — you share the same memory buffer. This matters when working with datasets in the tens-of-gigabytes range or building pipelines that cross tool boundaries.
Related guides
AI and Artificial-Life Tools on Linux
Set up open-source AI/ML and artificial-life toolkits on Linux: PyTorch, JAX, DEAP, Avida, NetLogo, and RL environments with GPU driver guidance.
Assembly Language on Linux: A Starter Guide
Write x86-64 assembly on Linux from scratch: install NASM and GAS, learn syscalls, assemble and link a working program, then inspect and debug it.
How to Benchmark Disk Performance with fio
Learn to benchmark Linux disk performance with fio: writing job files, testing latency and throughput, and interpreting IOPS and percentile output correctly.
The Linux Boot Process Explained
Trace the full Linux boot sequence from UEFI firmware through GRUB2, the kernel, initramfs, and systemd to your login prompt — with diagnostics at each stage.