Dataverse
The Dataverse class is the main entry point for working with a Dataverse installation. It establishes a connection to a Dataverse server and provides methods to create new datasets, retrieve existing ones, browse collections, and access all available features.
When you instantiate a Dataverse object, it automatically connects to the server and retrieves information about available metadata blocks. This eliminates the need to manually configure complex metadata structures. The class handles this configuration automatically. It functions as a factory for creating Dataset instances with properly configured metadata blocks and provides convenient access to collections, metrics, and various API clients.
Overview
Section titled “Overview”The Dataverse class connects to a Dataverse installation and supports several important tasks.
-
Creating new datasets: When creating a dataset, the class automatically fetches all available metadata blocks from the server and configures them accordingly. Metadata blocks are structured collections of related fields. For example, a “citation” block contains fields for title, author, and description, while a “geospatial” block contains fields for coordinates and location data.
-
Fetching existing datasets and collections: You can retrieve datasets and collections that already exist on the server by their identifier, such as a DOI or database ID.
-
Accessing metrics and statistics: The class provides easy access to statistics about the Dataverse installation, such as how many datasets it contains and how collections are organized by subject area.
-
Managing collections: Collections (also called “sub-dataverses”) are organizational units that group related datasets. The class enables you to create new collections and browse existing ones.
-
Providing access to underlying API clients: For advanced users who need more control, the class exposes the underlying API clients that communicate directly with the Dataverse server.
Initialization
Section titled “Initialization”To begin working with a Dataverse installation, create a Dataverse instance by specifying the server URL and optionally providing authentication credentials for operations that require them.
Create a Dataverse instance by providing the base URL of your Dataverse installation:
from pyDataverse import Dataverse
# Connect to a Dataverse instance (read-only access)# This works for browsing public datasets and reading metadatadv = Dataverse(base_url="https://demo.dataverse.org")
# With API token for authenticated operations# You need this if you want to create, modify, or delete datasetsdv = Dataverse( base_url="https://dataverse.example.edu", api_token="your-api-token-here")Understanding the Parameters
Section titled “Understanding the Parameters”When creating a Dataverse instance, you need to provide:
-
base_url(str, required): The base URL of the Dataverse installation to connect to. Examples include"https://demo.dataverse.org"and"https://dataverse.harvard.edu". This parameter specifies the target server for all API requests. -
api_token(str, optional): An authentication token that authorizes operations requiring permissions. It is required for creating, modifying, or deleting datasets or collections. It is not required for browsing or reading public data. You can generate API tokens in your Dataverse account settings. -
verbose(int, optional): Controls the verbosity of logging output for debugging purposes. The default is 1, which provides moderate logging. Increase the value for more detailed output or decrease it for quieter operation.
Creating Datasets
Section titled “Creating Datasets”One of the most common tasks you’ll do with the Dataverse class is creating new datasets. A dataset in Dataverse is a container that holds your research data files along with metadata (information about the data, like who created it, when, and what it contains).
The Dataverse class simplifies dataset creation by automatically configuring all available metadata blocks. Metadata blocks are structured collections of related fields that organize dataset metadata. Each Dataverse installation may have different metadata blocks enabled. Common examples include “citation” for basic information such as title and authors, “geospatial” for location data, and “social science” for survey-specific fields. The class automatically discovers available blocks on the server and configures them accordingly.
Here’s how to create a new dataset:
from pyDataverse import Dataverse
# First, connect to your Dataverse installationdv = Dataverse("https://demo.dataverse.org")
# Create a new dataset with basic informationdataset = dv.create_dataset( title="My Research Dataset", description="A comprehensive dataset containing experimental results from our study on machine learning algorithms", authors=[ { "name": "Jane Smith", "affiliation": "University of Science", "identifier_scheme": "ORCID", # Optional: identifies the author using ORCID "identifier": "0000-0000-0000-0000" # Optional: the actual ORCID number } ], contacts=[ { "name": "Jane Smith", "email": "jane.smith@university.edu", "affiliation": "University of Science" # Optional: where they work } ], subjects=["Computer and Information Science", "Engineering"] # Categories for the dataset)
# The dataset is now ready to use with all metadata blocks configured# You can access the citation metadata block like this:print(dataset.metadata_blocks["citation"].title)# Output: 'My Research Dataset'Authors and contacts are provided as Python dictionaries that use curly braces {}. Each author dictionary can include the author’s name and affiliation. It can also include an identifier scheme, such as ORCID, together with the corresponding identifier value.
The subjects parameter accepts a list of predefined subject categories that organize datasets by research area. Common subjects include “Agricultural Sciences”, “Physics”, and “Social Sciences”. You can specify multiple subjects for datasets that span multiple research areas.
After creating the dataset, it is ready to use locally, but it has not yet been uploaded to the server. You can add files to it, modify its metadata, and upload it when you are ready. See the Dataset documentation for more details on working with datasets after they are created.
Accessing Collections and Datasets
Section titled “Accessing Collections and Datasets”The Dataverse class provides convenient properties that let you access collections and datasets directly from the root level of the installation. These properties return special view objects that support both iteration (looping through items) and dictionary-like access (looking up items by identifier).
Collections
Section titled “Collections”The collections property provides access to all collections at the root level of the Dataverse installation. It supports two access patterns.
# Iterative access: iterate over all collectionsfor collection in dv.collections: print(collection.identifier)
# Direct access: retrieve a specific collection by identifierharvard_collection = dv.collections["harvard"] # Access by aliascollection = dv.collections[123] # Access by database IDThis property is useful for browsing available collections and for accessing a specific collection when its identifier is known.
Datasets
Section titled “Datasets”Similarly, the datasets property provides access to all datasets at the root level.
# Iterative access: iterate over all datasetsfor dataset in dv.datasets: print(dataset.title)
# Direct access: retrieve a specific dataset by identifierdataset = dv.datasets["doi:10.5072/FK2/ABC123"] # Access by DOIdataset = dv.datasets[12345] # Access by database IDThese properties access datasets and collections at the root level only. To access datasets within a specific collection, first fetch that collection and then use its datasets property.
Creating Collections
Section titled “Creating Collections”Collections (sub-dataverses) are organizational units that structure a Dataverse installation by grouping related datasets. Common use cases include collections for research labs, departments, or projects.
Creating a collection requires several pieces of information.
collection = dv.create_collection( alias="research-lab", # Short, unique identifier (no spaces allowed) name="Research Laboratory", # Human-readable display name description="A collection for research lab datasets and publications", # What this collection is for affiliation="Department of Science", # Which organization this belongs to dataverse_type="LABORATORY", # What kind of collection this is dataverse_contacts=["lab@university.edu", "admin@university.edu"], # Who to contact about this collection parent="root" # Where to create it (usually "root" for top-level, or another collection's alias))Parameter descriptions:
-
alias: A short, unique identifier for the collection. It must not contain spaces and should be memorable. This identifier is used to reference the collection in code and URLs. -
name: The display name visible to users. It can be longer and more descriptive than the alias. -
description: A text description that explains the purpose of the collection and helps users understand what types of datasets it contains. -
affiliation: The organization or department that owns the collection. This is typically a university, department, or research institution. -
dataverse_type: This field categorizes the type of collection. Common types include:"DEPARTMENT": for academic departments"LABORATORY": for research labs"RESEARCH_PROJECTS": for specific research projects"JOURNALS": for journal related collections"ORGANIZATIONS_INSTITUTIONS": for larger organizations"RESEARCHERS": for individual researchers"RESEARCH_GROUP": for research groups"TEACHING_COURSES": for course related content"UNCATEGORIZED": for collections that do not fit other categories
-
dataverse_contacts: A list of email addresses for collection administrators or maintainers who can be contacted regarding the collection. -
parent: The parent location for the collection. Use"root"to create it at the top level, or provide another collection alias to create it as a sub-collection.
After creating a collection, you can start adding datasets or create sub-collections within it.
Metrics
Section titled “Metrics”The Dataverse class provides easy access to metrics and statistics about the Dataverse installation. Metrics are useful for understanding how much content is in the installation, how it is organized, and how it is used.
You can access metrics through the metrics property:
# Get total counts for different content typestotal_datasets = dv.metrics.total("datasets") # How many datasets existtotal_collections = dv.metrics.total("dataverses") # How many collections existtotal_files = dv.metrics.total("files") # How many files existtotal_downloads = dv.metrics.total("downloads") # How many times files have been downloaded
# Get collections grouped by subject area# This returns a pandas DataFrame showing how many collections exist in each subject categorycollections_by_subject = dv.metrics.collections_by_subject
# Get a summary DataFrame with all key metrics# This is convenient when you want an overview of everythingsummary = dv.metrics.dfMetrics are particularly useful for:
- Administrators: understanding the size and scope of their Dataverse installation
- Researchers: finding out what kind of content is available
- Analysts: tracking usage and growth over time
The metrics are fetched from the Dataverse server, so they reflect the current state of the installation.
Available Licenses
Section titled “Available Licenses”Dataverse installations can be configured with different license options that dataset creators can choose from. Common licenses include Creative Commons licenses, custom institutional licenses, and installation specific licenses. The licenses property gives you access to the list of licenses available on your Dataverse installation.
# Get all available licenseslicenses = dv.licenses
# Iterate through them to see what's availablefor license in licenses: print(f"{license.name}: {license.uri}")This is useful when you create datasets and need to determine available license options. It is also useful when you write code that must handle different license types.
Metadata Blocks
Section titled “Metadata Blocks”One of the key features of the Dataverse class is its automatic handling of metadata blocks. Metadata blocks are structured collections of related fields. For example, a “citation” block contains fields for title, author, and publication date. A “geospatial” block contains fields for coordinates and map projections.
The Dataverse class automatically fetches information about all available metadata blocks from your Dataverse installation and creates Pydantic models for them. Pydantic is a Python library that provides data validation. These models ensure that the metadata you provide is in the correct format.
Here’s how you can work with metadata blocks:
# Get Pydantic model classes for all metadata blocks# This returns a dictionary mapping block names to their model classesblocks = dv.to_pydantic()
# Access a specific block model# For example, get the model for the citation blockcitation_block = blocks["citation"]
# Now you can create instances of this model or inspect its structureThis automatic discovery eliminates the need to manually configure metadata blocks or determine which fields are available. When you create a dataset, all available metadata blocks are automatically included and configured for use.
Model Conversion
Section titled “Model Conversion”The Dataverse class can convert dataset structures to JSON schema format. JSON schema is a standard way of describing data structures and is useful for documentation, validation, or integration with other tools.
# Get the JSON schema for datasetsschema = dv.json_schema()
# This schema describes the structure of all metadata blocks# You can use it for validation, documentation, or toolingThis is primarily useful for advanced users who need to work with JSON schemas or integrate Dataverse datasets with other systems that understand JSON schema.
Related Classes
Section titled “Related Classes”The Dataverse class works closely with several other classes in the pyDataverse library:
-
Dataset: Represents a Dataverse dataset with metadata blocks and files. This is what you get when you create or fetch a dataset. It provides methods for adding files, modifying metadata, and uploading to the server. -
Collection: Represents a Dataverse collection (sub-dataverse). Collections organize datasets into groups. You can browse datasets within a collection, create new datasets, and manage collection metadata. -
File: Represents a file within a dataset. Files are the actual data files (like CSV files, images, code, and similar content) that are stored in datasets. TheFileclass provides methods for reading, downloading, and managing files. -
NativeApi: Low level API client for direct Dataverse API calls. This is what theDataverseclass uses internally, but you can also use it directly if you need more control or want to perform operations that are not yet wrapped in high level methods.
See Also
Section titled “See Also”- Dataset Documentation: Learn more about working with datasets, including adding files, modifying metadata, and uploading content to the server.
- Collection Documentation: Learn more about collections, including browsing content and managing collection metadata.
- Native API Documentation: Detailed reference for advanced users who need direct API access.