File
The File class represents an individual file within a Dataverse dataset. Files are the actual data content stored in datasets—the CSV files, images, code, documents, and other resources that researchers want to share and preserve. Each file has associated metadata (such as name, size, content type, description, and categories) and provides methods for accessing file contents and managing file properties.
In Dataverse, files are organized hierarchically within datasets using paths, similar to a filesystem. Files can be organized into virtual directories (like “data/”, “code/”, “docs/”) to keep related files together. Each file has a unique integer identifier assigned by the Dataverse server, which is used for API operations, and a path that describes its location within the dataset.
This class provides a convenient interface for working with files, including reading file contents, downloading files to your local filesystem, updating file metadata, and working with tabular data files using pandas. It abstracts away the complexity of working with Dataverse’s file API while maintaining full access to file functionality.
Overview
Section titled “Overview”The File class encapsulates all operations related to a file within a dataset, providing a unified interface for working with files throughout their lifecycle—from initial access through reading, analysis, and metadata management.
The class handles several key responsibilities:
-
Metadata access: Retrieve comprehensive metadata about the file, including its name, size, content type, description, categories, and other properties. This metadata helps understand what the file contains and how it should be used.
-
File reading: Read file contents in various ways—as text or binary data, or as structured tabular data using pandas DataFrames. The class supports both loading entire files into memory and streaming large files in chunks.
-
File downloading: Download files from Dataverse to your local filesystem for offline access or backup purposes. Downloads are performed efficiently using streaming to handle large files.
-
Metadata management: Update file metadata such as description, categories, filename, and directory path. This allows you to organize and describe files after they’ve been uploaded.
-
Tabular data operations: Detect tabular files and provide convenient methods for loading them as pandas DataFrames, with support for custom parsing options and streaming for large files.
Attributes
Section titled “Attributes”A File instance contains key attributes that identify it and connect it to its parent dataset:
-
identifier(int): The unique integer identifier assigned to the file by the Dataverse server. This identifier is used for all API operations involving the file and remains constant throughout the file’s lifetime. It’s a frozen attribute, meaning it cannot be changed after the File object is created. -
dataset(Dataset): A reference to the parentDatasetinstance that contains this file. This provides access to dataset-level operations and ensures the file knows its context within the dataset structure.
The identifier is the primary way to reference a file in API calls, while the dataset reference enables file operations that require dataset context, such as reading file contents or accessing dataset-level properties.
Accessing Files
Section titled “Accessing Files”Files are typically accessed through a dataset’s files property, which provides a view of all files in the dataset:
from pyDataverse import Dataverse
dv = Dataverse("https://demo.dataverse.org")dataset = dv.fetch_dataset("doi:10.5072/FK2/ABC123")
# Access a file by pathfile = dataset.files["data/results.csv"]
# Iterate over all filesfor file in dataset.files: print(f"{file.path}: {file.metadata.data_file.filesize} bytes")
# Access file propertiesprint(file.identifier) # Integer IDprint(file.path) # Path within datasetprint(file.dataset.title) # Parent dataset titleThe files property returns a FilesView object that supports both iteration and dictionary-like access by file path. When you access a file by path, the view automatically creates a File object with the correct identifier and dataset reference.
Files can also be accessed by iterating over the files view, which loads file information on demand. This lazy-loading approach is efficient for datasets with many files, as you only create File objects for files you actually need.
File Metadata
Section titled “File Metadata”Each file has comprehensive metadata that describes its properties and characteristics. You can access this metadata through the metadata property:
# Get file metadatametadata = file.metadata
# Access metadata fieldsprint(metadata.label) # Filenameprint(metadata.directory_label) # Directory pathprint(metadata.description) # File descriptionprint(metadata.categories) # Category tags
# Access data file informationif metadata.data_file: print(metadata.data_file.filesize) # File size in bytes print(metadata.data_file.content_type) # MIME type print(metadata.data_file.md5) # MD5 checksum print(metadata.data_file.id) # File identifierThe metadata property fetches fresh metadata from the Dataverse server each time it’s accessed, ensuring you always have up-to-date information. This is useful when metadata might have been changed by other users or through the web interface.
File metadata includes:
- Label: The filename of the file
- Directory label: The virtual directory path within the dataset (e.g., “data/”, “code/”)
- Description: A human-readable description of the file’s contents or purpose
- Categories: A list of category strings (tags) that help organize and classify the file
- Data file information: Size, content type (MIME type), checksums, and other technical details
The metadata is returned as a FileInfo object, which provides structured access to all file properties. This object validates the data structure and ensures you’re working with consistent, well-formed metadata.
File Path
Section titled “File Path”The path property provides the file’s full path within the dataset, combining the directory label and filename:
# Get the file pathfile_path = file.path# Output: "data/results.csv" or "readme.txt" (if no directory)
# The path combines directory_label and label# For a file in "data/" directory named "results.csv":# path = "data/results.csv"The path is constructed by joining the directory label and filename with a forward slash. If the file is in the root of the dataset (no directory), the path is just the filename. This path format is consistent with filesystem paths and makes it easy to understand file organization within the dataset.
The path property is computed from the metadata each time it’s accessed, so it always reflects the current state of the file. If you update the file’s directory or filename using update_metadata(), the path property will reflect those changes on the next access.
Reading Files
Section titled “Reading Files”The File class provides several methods for reading file contents, supporting different use cases and file types.
Reading Text and Binary Files
Section titled “Reading Text and Binary Files”The open() method provides a file-like interface for reading file contents:
# Read a text filewith file.open(mode="r") as f: content = f.read() print(content)
# Read a binary filewith file.open(mode="rb") as f: binary_data = f.read() # Process binary data...
# Read line by linewith file.open(mode="r") as f: for line in f: print(line.strip())The open() method returns a DataverseFileReader object that supports standard file operations like read(), readline(), and iteration. It can be used with Python’s context manager (with statement) for proper resource management, ensuring file handles are closed after use.
For text mode (mode="r"), the file is automatically decoded using UTF-8 encoding. For binary mode (mode="rb"), you receive raw bytes. Choose the appropriate mode based on what you plan to do with the file—text mode for reading text files, binary mode for images, executables, or other binary data.
The reader streams data from the Dataverse server, so it’s efficient for large files. Data is fetched in chunks as you read, rather than loading the entire file into memory at once. This makes it suitable for processing files of any size.
Reading Tabular Files
Section titled “Reading Tabular Files”For CSV and other tabular files, the File class provides convenient methods that return pandas DataFrames:
# Check if file is tabularif file.is_tabular: # Load entire file as DataFrame df = file.open_tabular() print(df.head())
# Load with custom options df = file.open_tabular( usecols=["col1", "col2"], # Select specific columns dtype={"col1": str, "col2": int}, # Specify data types na_values=["N/A", "null"] # Values to treat as NaN )
# Load without header row df = file.open_tabular(no_header=True)The open_tabular() method automatically detects tabular files based on their content type (MIME type). Common tabular formats include CSV, TSV, and SAV files. If you try to call this method on a non-tabular file, a ValueError is raised.
The method accepts the same keyword arguments as pandas’ read_csv() function, giving you fine-grained control over how the file is parsed. You can specify delimiters, data types, which columns to read, how to handle missing values, and many other options. This flexibility allows you to adapt the reading process to your specific data format and analysis needs.
The is_tabular property provides a quick way to check if a file is tabular before attempting to load it. This is useful when iterating over files and only processing tabular ones, or when you want to provide different handling for different file types.
Streaming Large Tabular Files
Section titled “Streaming Large Tabular Files”For very large tabular files that don’t fit in memory, use stream_tabular() to process them in chunks:
# Stream large file in chunksfor chunk in file.stream_tabular(chunk_size=10000): # Process each chunk (each is a DataFrame) process_data(chunk) print(f"Processed {len(chunk)} rows")
# Stream with custom optionsfor chunk in file.stream_tabular( chunk_size=5000, usecols=[0, 1, 2], # Only read first 3 columns dtype={"col1": str}): analyze_chunk(chunk)The stream_tabular() method yields the file in chunks, where each chunk is a pandas DataFrame containing a subset of the rows. This allows you to process large files incrementally without loading everything into memory at once. The chunk_size parameter controls how many rows are included in each chunk.
This approach is particularly useful for datasets with millions of rows or when working on systems with limited memory. You can process each chunk independently, aggregate results, or write processed chunks to output files. The method accepts the same keyword arguments as open_tabular(), so you can apply the same parsing options to each chunk.
Like open_tabular(), this method only works with tabular files. If you try to stream a non-tabular file, a ValueError is raised.
Downloading Files
Section titled “Downloading Files”You can download files from Dataverse to your local filesystem using the download() method:
# Download to a specific pathfile.download("local_copy.csv")
# Download to a Path objectfrom pathlib import Pathfile.download(Path("downloads") / "results.csv")
# Download with custom chunk size (for very large files)file.download("large_file.bin", chunk_size=16384)The download() method streams the file from the Dataverse server and writes it to the specified local path. It handles the entire download process, including creating the file, writing data in chunks, and closing the file when complete. The method prints a confirmation message showing where the file was saved.
The chunk_size parameter controls how many bytes are read from the server per chunk. The default (8192 bytes) works well for most files, but you can increase it for very large files to improve download speed, or decrease it if you need to monitor progress more frequently.
The method automatically creates parent directories if they don’t exist, so you can specify paths like “downloads/data/results.csv” even if the “downloads/data/” directory doesn’t exist yet. If the target file already exists, it will be overwritten.
Downloads are performed efficiently using streaming, so even very large files can be downloaded without consuming excessive memory. The file is written to disk as it’s downloaded, rather than being loaded into memory first.
Updating File Metadata
Section titled “Updating File Metadata”You can update file metadata using the update_metadata() method:
# Update file descriptionfile.update_metadata(description="Updated description of the file contents")
# Update categories (tags)file.update_metadata(categories=["Data", "Results", "2024"])
# Update filenamefile.update_metadata(filename="new_filename.csv")
# Update directory pathfile.update_metadata(directory_label="processed_data")
# Update multiple properties at oncefile.update_metadata( description="Experimental results from trial 3", categories=["Experimental", "Results"], filename="trial3_results.csv")The update_metadata() method allows you to update any combination of metadata fields. You only need to provide the fields you want to change—omitted fields remain unchanged. This makes it easy to make targeted updates without needing to specify all fields.
When you update the filename or directory label, the file’s path changes accordingly. The path property will reflect the new path on the next access. This is useful for reorganizing files within a dataset or correcting file names.
The method sends updates to the Dataverse server immediately, so changes are visible right away. If the update fails (for example, due to permission issues or validation errors), an exception is raised. After a successful update, the metadata is refreshed, so subsequent accesses to the metadata property will reflect the changes.
Categories are particularly useful for organizing files. You can use them to tag files by type (e.g., “Data”, “Code”, “Documentation”), by purpose (e.g., “Raw”, “Processed”, “Analysis”), or by any other classification scheme that helps organize your files. Categories can be used for filtering and searching files within a dataset.
Complete Example
Section titled “Complete Example”The following example demonstrates a complete workflow for working with files:
from pyDataverse import Dataversefrom pathlib import Path
# Initialize connectiondv = Dataverse( base_url="https://demo.dataverse.org", api_token="your-token")
# Fetch a datasetdataset = dv.fetch_dataset("doi:10.5072/FK2/ABC123")
# Browse files in the datasetprint("Files in dataset:")for file in dataset.files: print(f" - {file.path} ({file.metadata.data_file.filesize} bytes)")
# Access a specific filecsv_file = dataset.files["data/results.csv"]
# View file metadataprint(f"\nFile: {csv_file.path}")print(f"Size: {csv_file.metadata.data_file.filesize} bytes")print(f"Type: {csv_file.metadata.data_file.content_type}")print(f"Description: {csv_file.metadata.description}")
# Check if file is tabularif csv_file.is_tabular: # Load as DataFrame df = csv_file.open_tabular() print(f"\nLoaded {len(df)} rows, {len(df.columns)} columns") print(df.head())
# Process large file in chunks if needed if len(df) > 100000: print("\nFile is large, processing in chunks...") for chunk in csv_file.stream_tabular(chunk_size=10000): process_chunk(chunk)
# Read a text filereadme_file = dataset.files["readme.txt"]with readme_file.open(mode="r") as f: content = f.read() print(f"\nReadme content:\n{content}")
# Download files locallycsv_file.download("local_results.csv")readme_file.download("local_readme.txt")
# Update file metadatacsv_file.update_metadata( description="Updated: Results from all experimental trials", categories=["Data", "Results", "Experimental"])
# Verify metadata updateprint(f"\nUpdated description: {csv_file.metadata.description}")print(f"Updated categories: {csv_file.metadata.categories}")This example demonstrates the typical workflow: accessing files through a dataset, viewing metadata, reading file contents (both tabular and text), downloading files locally, and updating file metadata. Files provide a convenient interface for working with individual data files within datasets, making it easy to access, analyze, and manage research data.
Related Classes
Section titled “Related Classes”Dataset- Represents a Dataverse dataset that contains filesDataverse- Factory class for creating and managing datasets and collectionsCollection- Represents a Dataverse collection that can contain datasets with files
See Also
Section titled “See Also”- Dataset Documentation - Learn more about working with datasets and their files
- Dataverse Documentation - Learn more about creating and managing datasets
- Collection Documentation - Learn more about collections and organizing datasets