Dataset
This article introduces the usage of the Bohrium dataset.
What is Dataset?
The Bohrium dataset provides capabilities for data import, download, data version management, data sharing, and dataset mounting. Have you ever encountered the following problems before:
- Most of the input files for jobs are the same, but every time you submit a job, you have to wait for the file packaging and uploading process, resulting in low job submission efficiency.
- Input files are large, and the packaging and uploading process takes a long time when submitting jobs.
- I have some files that I want to share with others, but I don't know how to share them.
- ...
Now, the dataset can solve the above problems for you, improve job submission efficiency, and address data sharing needs.
Creating a Dataset
Creating a dataset on the web interface
Click the "Dataset" button in the main menu to enter the dataset list page, as shown in Figure 1. Click the "New Dataset" button, as shown in Figure 2, to enter the dataset creation page.
Fill in the basic information of the dataset and upload files. After clicking "create", the system will create the v1 version of the dataset by default. Click here to view the Content Description.
After the information and files are prepared, click "Create". The dataset is created successfully, and the page will automatically redirect to the details page of this dataset version.
Creating a dataset using the command-line tool
When the dataset file is too large, the creation process might fail due to network issues or other factors, as the transfer time could be lengthy. Therefore, you can use the Bohrium CLI, which supports resuming from breakpoints, to create the dataset.
If an interruption occurs due to network issues or other factors, you can resume by re-executing the same command. Then, follow the prompt and enter y
to recover the previous files, allowing the process to resume from the breakpoint.
Summary:
Flags:
-m, --comment string dataset description
-h, --help help for create
-l, --lp string file local path
-n, --name string dataset name
-p, --path string dataset path
-i, --pid int project id
Parameter description:
Parameter | Abbreviation | Description | Required |
---|---|---|---|
--comment | -m | Dataset Description | 否 |
--name | -n | Dataset Name | 是 |
--path | -p | Dataset Path | 是 |
--pid | -i | project ID | 是 |
--lp | -l | project id | 是 |
案例:
$ bohr dataset create -n bigfile -p bigfile -i 26611 -l "/Users/dp/Downloads/test"
# Upload the test file to the bigfile dataset.
# Interrupt the creation during the upload process.
# Re-enter the same command and input ‘y’ to continue the upload.
Viewing a Dataset
Click the "Dataset" button in the main menu to enter the dataset list page. The list displays all the datasets you can use, including the datasets you created and the datasets others have created and shared with you.
Click on the dataset name to enter the dataset details page, where you can view the basic information of the dataset and the information of each version, obtain the file paths of each version, view and download version files.
You can also use the Bohrium CLI tool to view the dataset.
$ bohr dataset list # View all datasets (Press Ctrl+C to exit)
Editing and version management of a dataset.
If you have the management permissions for the dataset, you can perform operations such as adding new versions, deleting, and editing the basic information of the created dataset.
Version Management
If you need to make changes to the files in the current dataset, you can release a new version by using the "Create New Version" method.
Creating: Click the "New Version" button to enter the new dataset version creation page. The system will automatically import the existing files from the latest published version. You can add or delete files as needed, and click "Create" to release the new version.
Waiting for preparation to complete: Creating a new version requires some preparation time. During the preparation, other users cannot see or use this version. The duration of the preparation time is related to the number and size of version files. Please wait for the version to be ready before using it.
After a version is created, the files within the version cannot be changed. If adjustments are needed, you can create a new version.
All the versions you have released will be displayed in the dataset. You can add and delete dataset versions according to your actual needs. Other users can only see the datasets that you have successfully published.
Notice: Deleted versions cannot be restored and will no longer be viewable or usable.
Editing a Dataset
Click the "Edit" button on the dataset list page or dataset details page to modify the dataset's name, description, and permission scope.
In the dataset details page, you can also modify the description of each version.
Using a Dataset
The dataset is currently supported in the following scenarios:
Submitting a job
- Command Line Submission: You only need to modify your
job.json
, adding adataset_path
field. In this field, fill in the corresponding paths of the dataset versions you need to use in an array format, as shown in the red box in the image below.
When submitting a job, the method of specifying the input file directory is still supported, and both can be used simultaneously.
Here is an example of how to fill in job.json
:
{
"job_name": "DeePMD-kit test",
"command": " cd se_e2_a && dp train input.json > tmp_log 2>&1 && dp freeze -o graph.pb",
"log_file": "se_e2_a/tmp_log",
"backward_files": ["se_e2_a/lcurve.out","se_e2_a/graph.pb"],
"project_id": 0000,
"platform": "ali",
"machine_type": "c4_m15_1 * NVIDIA T4",
"job_type": "container",
"image_address": "registry.dp.tech/dptech/deepmd-kit:2.1.5-cuda11.6",
"dataset_path": ["/bohr/test1-51ov/v1","/bohr/test1-51ov/v2"]
}
- Web submission:
When submitting a job on the graphical interface, click the 'Select Dataset' button and choose the version of dataset you wish to use.
Use and share datasets in Notebook
When writing and posting notebook, you can use and share the datasets required for the notebook along with it.
Step 1: Select the dataset you want to use/share
On the Bohrium homepage, click the 'New - Notebook' button at the top left corner to enter the Notebook editing page.
Click the arrow on the right side to expand the extension panel.
Click on "Select Existing Datasets" to add the dataset version required for this notebook. You can also click on "New Dataset" to create a new dataset.
Notice: Please add the dataset before connecting to the node. Datasets added after the node has started will require a node restart to take effect.
Step 2: Use the dataset in the Notebook
Move the mouse over the selected dataset name and click the copy button to get the storage path of the dataset files. All dataset files are stored in this path.
Simply enter this path in the Notebook to use it. The path used in the example below is: /bohr/testdataset-6xwt/v1/
:
Example 1: Enter the dataset directory
cd /bohr/testdataset-6xwt/v1/
Example 2: List all files under the dataset
ls /bohr/testdataset-6xwt/v1/
Step 3: Post the Notebook and share the dataset
After the Notebook with the added dataset is posted, other users can view and use the corresponding dataset on the details page.
Use the dataset on the management node
When you create container management node, you can add the version of the dataset you need to mount, as shown in the figure below as '1'. After successful mounting and booting, you can find the dataset files on the management node at the path shown as '2'.
Dataset content description
Field Name | Description | Example |
---|---|---|
Dataset Name | The name of the dataset, which can be modified at any time | testdataset |
Dataset Path | The dataset files will be uploaded to this path. Please enter the recognizable content of the dataset in the input box, and the system will automatically generate a unique path corresponding to the version Notice: Modifying the path after uploading files will clear the files you have already uploaded, so please modify with caution | /bohr/testdataset-b2dh/v1 |
Files | The files included in this dataset version, support uploading local files or folders Notice: Please do not refresh or leave the page during file upload to avoid upload failure | -- |
Project | The project to which the dataset belongs. Project members can use the dataset by default. | testproject |
Permissions | Manageable: Permissions for editing, deleting, creating new versions, etc., of the dataset; the dataset creator and the creator and administrator of the project to which the dataset belongs have these permissions by default and cannot be changed Usable: The permission to view and use the dataset; project members to which the dataset belongs have this permission by default and cannot be changed. This permission can be granted to other projects or users | Manageable: the dataset creator and the creator and administrator of the project to which the dataset belongs Usable: project members to which the dataset belongs |
Description | The description of the dataset | 该数据集用于测试 |