Google Colab File Handling Complete Guide – Upload, Drive Integration, wget, External Storage & Gemini [2026]
Have you ever been writing code in Google Colab and gotten stuck because you didn’t know how to load files? Colab is a powerful environment that offers free T4 GPU (16GB VRAM), but its file system works fundamentally differently from your local PC.
This article thoroughly explains Colab’s file operations through 5 methods. From direct upload, Google Drive integration, command-line operations, external storage integration, to file download — all presented with concrete code examples based on the latest 2026 Colab specifications.
- Prerequisites: How Colab’s File System Works
- Google Colab Pricing Plan Comparison [2026]
- Method 1: Direct File Upload
- Method 2: Mount Google Drive (Most Practical)
- Method 3: File Operations with Linux Commands
- Method 4: External Storage and Data Source Integration
- Method 5: File Download (From Colab to External)
- 2025 New Feature: Gemini AI Integration
- Common Troubleshooting
- Alternative Free Cloud Notebooks Comparison
- Loading Methods by Data Format
- Frequently Asked Questions (FAQ)
- Conclusion: Choosing the Right Colab File Operation Method
Prerequisites: How Colab’s File System Works
The most critical point to understand first is that Colab’s runtime is a temporary virtual machine. When the session disconnects, all files uploaded to the /content directory or generated by code are completely erased.
Disk Capacity and Limits
- Disk capacity: approximately 77-100GB (system uses 31GB of the total 108GB)
- Single file upload limit: 2GB
- File count limit: mount operations may fail when exceeding approximately 10,000 files in a directory
- Session time: free tier maxes out at 12 hours (times out after 90 minutes idle)
This “temporary” nature is the biggest cause of frustration with Colab file operations. If you want to persist files, Google Drive integration is essential.
Google Colab Pricing Plan Comparison [2026]
Before diving into file operations, note that available resources vary depending on your plan:
- Free: T4 GPU, 12-15GB RAM, 15-30 GPU hours/week (limited during peak times), max 12-hour sessions
- Pro ($9.99/month): T4/P100 GPU, 32GB RAM, compute unit-based with no weekly limit
- Pro+ ($49.99/month): Priority V100/A100 GPU, 52GB RAM, max 24-hour sessions, high-memory configuration available
- Enterprise (custom pricing): Vertex AI/Compute Engine integration, dedicated runtimes, organization management
Basic file operations are the same across all plans. GPU time and RAM differences only matter for large-scale model training and data processing.
Method 1: Direct File Upload
The simplest method, convenient when you want to quickly use a small number of files.
GUI Upload
Open the file panel on the left (folder icon) and click the “Upload” button, or drag and drop files. Uploaded files are placed in the /content directory.
Code Upload
You can also use the files.upload() function from the google.colab module. When executed, a file selection dialog appears, and selected files are saved to the current directory. The return value is a dictionary with filenames as keys and byte data as values.
Important: Directly uploaded files disappear when the session ends. Always back up important files to Google Drive. Also, files exceeding 2GB cannot be uploaded this way.
Method 2: Mount Google Drive (Most Practical)
For handling large numbers of files or carrying data across sessions, mounting Google Drive is the best approach. Running drive.mount(‘/content/gdrive’) from the google.colab module mounts your entire Drive to /content/gdrive/MyDrive via OAuth 2.0 authentication.
Drive Mount Optimization Tips
- By default, the entire Drive is mounted, but specifying paths to specific folders improves responsiveness
- Mounting may fail if there are 10,000+ files at the Drive root. Keep your folder structure organized
- Run drive.flush_and_unmount() before ending your session to ensure write cache is properly flushed
- Google Drive free capacity is 15GB. For larger data, consider Google One (starting at $2/month)
Method 3: File Operations with Linux Commands
Since Colab’s runtime is an Ubuntu-based Linux environment, you can use shell commands by prefixing cells with “!”.
Frequently Used Commands
!ls -la /content/— List files in the current directory!cp source.csv /content/gdrive/MyDrive/— Copy file to Drive!wget [URL]— Download file directly from the web!unzip archive.zip -d /content/data/— Extract ZIP file!du -sh /content/*— Check size of each directory
The wget command is particularly useful — it lets you download publicly available datasets and model files directly to Colab from a URL. For large files, saving via wget from Colab to Drive is dramatically faster than going through your local PC.
Method 4: External Storage and Data Source Integration
Kaggle Datasets
Using the Kaggle API, you can download public datasets from Kaggle with a single command. Simply upload the kaggle.json authentication file and run the !kaggle datasets download command.
Hugging Face Models and Datasets
Using the transformers and datasets libraries, you can directly load models and datasets from Hugging Face Hub. For large models, configuring cache storage on Drive saves download time when reconnecting.
Google Cloud Storage (GCS)
For projects handling large-scale data, GCS integration is effective. Use the gcloud command-line tool to copy files from buckets. The Colab Enterprise plan also enables native integration with Vertex AI.
Method 5: File Download (From Colab to External)
To save files generated in Colab to your local PC, use files.download(‘filename’) from the google.colab module. This automatically initiates a browser download.
For batch downloading multiple files, it’s more efficient to compress them into a ZIP first. Use !zip -r output.zip /content/results/ to compress an entire folder, then download with files.download(‘output.zip’).
2025 New Feature: Gemini AI Integration
In 2025, Colab underwent a major overhaul into an “AI-First” environment with the integration of Gemini 2.5 Flash. Key new features include:
- AI Pair Programming: Automatic code generation, auto error correction, and suggestions displayed in diff format
- Data Science Agent (DSA): Fully integrated in March 2025. Autonomously performs data exploration, analysis, and pattern discovery from natural language instructions alone
- AI Prompt Cells: Enables no-code/low-code data transformation, analysis, and visualization
- Gemini Model Access: All users can access Gemini/Gemma models for free via the google.colab.ai library
For file operations too, simply instructing the AI in natural language like “load this CSV file and visualize it” will auto-generate the appropriate code.
Common Troubleshooting
Drive Mount Fails
The most common cause is too many files in the Drive root directory. Having 10,000+ files can cause timeouts. The solution is to organize your Drive with proper folder structure. Also, browser popup blockers may be blocking the OAuth authentication window.
Files Disappeared After Session Disconnect
This is Colab’s most frustrating specification. There are three countermeasures: First, include code cells that periodically save important intermediate files to Google Drive. Second, build automatic checkpoint saving to Drive into your training code. Third, the Pro+ plan offers up to 24-hour sessions, which helps for long training runs.
Encoding Errors When Reading Files
This frequently occurs with CSV files containing Japanese text. Explicitly specify encoding=’utf-8′ or encoding=’shift_jis’ in pandas’ read_csv(). If those don’t work, try encoding=’cp932′.
Alternative Free Cloud Notebooks Comparison
- Kaggle Notebooks: 30+ GPU hours/week. T4/P100 GPU support. Background execution continues even with tab closed. Direct dataset access is its strength
- AWS SageMaker Studio Lab: T4 GPU, 16GB RAM. 10 GPU hours/month. No AWS account required
- Lightning AI: 4 hours GPU + 8 hours CPU daily. No credit card required. Optimized for PyTorch Lightning
If you’re dissatisfied with Colab’s free tier, using Kaggle Notebooks as a secondary option offers the best value.
Loading Methods by Data Format
CSV / TSV
pandas’ read_csv() is the standard. For Japanese files, specify encoding=’utf-8′ or ‘cp932′. For TSV, add sep=’\t’. For large CSV files, use the chunksize parameter for chunked reading to prevent memory shortages.
Image Files (PNG / JPG)
Load with PIL (Pillow) or OpenCV. For machine learning image datasets, torchvision.datasets.ImageFolder or tf.keras.utils.image_dataset_from_directory are convenient as they automatically recognize labels from folder structure.
JSON / JSONL
Load directly with pandas’ read_json(). For JSONL files (one JSON per line), specify lines=True. JSONL is the mainstream format for LLM fine-tuning datasets, so this loading method is worth remembering.
Excel (xlsx)
Use pandas’ read_excel(). Requires the openpyxl library — run !pip install openpyxl first if not installed. For workbooks with multiple sheets, specify sheets using the sheet_name parameter.
Frequently Asked Questions (FAQ)
How can I prevent frequent runtime disconnections?
There’s no complete prevention method, but three measures help: First, keep the browser tab active without closing it. Second, incorporate checkpoint saving to Google Drive during training. Third, upgrading to Pro+ gives you up to 24-hour sessions with priority connections.
Can I choose the Python version in Colab?
Since May 2025, Colab’s default runtime uses Python 3.12. If you want to change versions, you can build a virtual environment using conda or pyenv, but settings are lost when the session ends. It’s practical to consolidate your setup scripts into cells.
How do I handle files larger than 2GB?
Direct upload has a 2GB limit, but via Google Drive, the limit depends on your Drive capacity. Using wget to download from the web, file size is limited only by Colab’s disk capacity (approximately 77-100GB).
How can I keep files after session disconnects?
Mounting and working with Google Drive is the most reliable approach. Mount Drive at the beginning of your code and set all output paths to Drive locations. For model training, always include code to save checkpoints to Drive at each epoch.
Are there differences in file operations between free and Pro?
File operations themselves are identical. The differences are in session time (free 12 hours vs Pro+ 24 hours) and RAM (free 15GB vs Pro+ 52GB). Pro or higher may be needed when loading large dataframes into memory.
Can I clone a GitHub repository directly in Colab?
Yes. Simply run !git clone [repository URL] in a cell. For private repositories, authenticate using a Personal Access Token. Cloned files are stored in /content and disappear when the session ends, so push any changes beforehand.
Conclusion: Choosing the Right Colab File Operation Method
The key to Google Colab file operations is choosing the right method for your use case. Remember: direct upload for temporary use of small files, Google Drive mount for ongoing work, wget or GCS integration for large data, and files.download for retrieving results.
With Gemini integration since 2025, you can even have AI generate the file operation code itself. Just say “load this CSV and make a graph” and the code appears. However, the fundamental behavior of files being lost on session disconnect hasn’t changed, so regular saving to Google Drive remains an essential part of your workflow.

