Xvc for DVC Users
DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.
Note that this document refers mostly to Xvc v0.6 and DVC 2.30. Both commands are in development, and similarities and differences may change in time.
Similarities
The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.
Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC.
Xvc has the same optional and recommended reliance on Git but all features are available without Git. Xvc uses Git with its CLI interface like a user, without any reliance on a particular library.
Both of these commands use hashing the content to detect changes in files.
Both of these use DAGs to represent pipelines.
Conceptual Differences
stage vs. step: What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
remmote vs storage: What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
pipeline definitions: In DVC, there is a 1-1 correspondence between
dvc.yaml
files in a repository and the pipelines. When you want to create a
new pipeline, you create a new file in DVC.
In Xvc, pipelines are abstract. They are defined with xvc pipeline
family of commands. No single file contains a
pipeline definition. You can export pipelines to
YAML, JSON, and TOML, and import them after
making changes. Xvc doesn't consider any file format authoritative for
pipelines, and their YAML/JSON/TOML representation may change between versions.
Files in the user workspace; DVC is more liberal in creating files among
user files in the repository. When you add a file to DVC with dvc add
, DVC
creates a .dvc
file next to it. Xvc only creates a .xvc/
directory in the
repository root and only updates .gitignore
files to hide tracked files from
Git. You won't see any files added next to your data files.
cache-type vs recheck-method: Cache type, (or rather recheck method) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to the cache, another file copied from the cache, etc.
Command Differences
While naming Xvc commands, we tried our best to avoid name clashes with Git.
Having both git push
and dvc push
commands may look beneficial for
understanding at first, as these two are analogous. However, giving the same name
also hides important details that are more difficult to emphasize later. (e.g.
DVC experiments are Git objects that are pushed to Git remotes, while the
files changed during experiments are pushed to DVC remotes.)
dvc add
can be replaced by xvc file track
. dvc add
creates a .dvc
file (formatted in YAML) in the repository. Xvc doesn't
create separate files for tracked paths.
Instead of deleting .dvc
files to remove a file from DVC, you can use xvc file untrack
. It can also restore all versions of
an untracked file to a directory.
dvc check-ignore
can be replaced by xvc check-ignore
. Xvc version can be
used against any other ignore filename. (.gitignore
,.ignore
,
.fooignore
...)
dvc checkout
is replaced by xvc file recheck
.
There is a --recheck-method
(shortened as --as
) option in several Xvc
commands to tell whether to check out as symlink, hardlink, reflink or copy.
dvc commit
is replaced by xvc file carry-in
. They
both cache the files if they are changed.
There is no command similar to dvc config
. You can either edit the
configuration files, or modify configuration with
-c
options in each run. You can also supply all configuration from the
environment. See Configuration.
dvc dag
is replaced by xvc pipeline dag
. DVC version uses ASCII art to
present the pipeline. Xvc doesn't provide ASCII art, instead provides either a
Graphviz representation or mermaid diagram.
dvc data status
and dvc status
can be replaced by xvc file list
. Xvc
version doesn't provide information about the pipelines, or remote storages.
There is no command similar to dvc destroy
in Xvc. There will be an xvc deinit
command at some point. Until then, you can just
delete .xvc/
directory and all .xvcignore
files in your repository to
destroy.
There is no command similar to dvc diff
in Xvc.
There is no command similar to dvc doctor
or dvc version
. Version
information should be visible in the help text. Unless compiled from source
with feature flags, Xvc binaries don't have feature
differences.
Currently, there are no commands corresponding to dvc exp
set of commands.
This is on the roadmap for Xvc. Scope, implementation, and actual commands may
differ.
dvc fetch
is replaced by xvc file bring --no-recheck
.
Instead of freezing "pipeline stages" as in dvc freeze
, and unfreezing with
dvc unfreeze
, xvc pipeline step update --changed [never|always|by_dependencies]
can be used to specify if/when to run a
pipeline step.
Instead of dvc gc
to "garbage-collect" files, you can use xvc file remove
with various options.
There is no corresponding command for dvc get-url
in Xvc. You can use
wget
or curl
instead.
Currently there is no command to replace dvc get
and dvc import
, and dvc import-url
. URL dependencies are supported in the pipeline with xvc pipeline step dependency --url
.
Instead of dvc install
like hooks, Xvc issues Git commands itself if
git.auto_commit
, git.auto_stage
configuration options are set.
There is no corresponding command for dvc list-url
.
dvc list
is replaced by xvc file list
for local
paths. Its remote capabilities are not implemented but is on the roadmap.
Xvc doesn't mix files from different repositories in the same storage. There is an ID for each Xvc repo that's also used in remote storage paths.
Currently, there is no params/metrics tracking/diff similar to dvc params
,
dvc metrics
or dvc plots
commands in Xvc.
dvc move
is replaced by xvc file move
.
dvc push
is replaced by xvc file send
.
dvc pull
is replaced by xvc file bring
.
There are no commands similar to dvc queue
for experiments in Xvc.
Experiment tracking will probably be handled differently.
dvc remote
set of commands are replaced by xvc storage
set of commands.
You can use xvc storage new
for adding new storages. Currently, there is no
"default remote" facility in Xvc. Instead of dvc remote modify
, you can use
xvc storage remove
and xvc storage new
.
There is no single command to replace dvc remove
. For files, you can use
xvc file delete
. For pipelines steps, you can use
]xvc pipeline step remove
Instead of dvc repro
, Xvc has xvc pipeline run
. If you want to reproduce a pipeline, you can
use xvc pipeline run
again.
xvc root
is for the same purpose as dvc root
.
dvc run
(that defines a stage in DVC pipeline and immediately runs it) can
be replaced by xvc pipeline
set of commands. xvc pipeline new
for a new pipeline, xvc pipeline step new
for a new step in the pipeline, xvc pipeline step dependency
to specify
dependencies of a step, xvc pipeline step output
to specify outputs of a step and
xvc pipeline run
to run this pipeline.
Instead of dvc stage add
, we have xvc pipeline step new
. For dvc stage list
, we have xvc pipeline step list
.
There is no (need) for dvc protect
or dvc unprotect
commands in Xvc.
"Cache type" of Xvc is not a repository-wide option, and called "recheck
method". If you want to track a certain directory as
symlink, and another as hardlink, you can do so with xvc file recheck --as
.
If you want identical files copied to one directory and linked in another,
xvc file copy
can help.
DVC needs dvc update
for external dependencies in pipelines. Xvc checks
their metadata like any other dependency before downloading and invalidates the
step if the URL/file has changed automatically.
DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.
Extra Features of Xvc
Xvc can use multiple of hashing functions, like BLAKE3, BLAKE2s, SHA2-256 and SHA3-256. More can be added upon request. The only requirement for hashes is having 32-hex digits (256 bits) of output.
In its pipelines, Xvc has more flexibility in defining dependencies. DVC supports files, directories and hyperparameters. Xvc supports additionally
- globs
- text file lines defined by line numbers,
- text file lines defined by regular expressions,
- URLs
- Sqlite queries,
Technical Differences
-
DVC is written in Python. Xvc is written in Rust.
-
DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
-
DVC tracks file/directory changes in separate
.dvc
files. Xvc tracks them in.json
files in.xvc/store
. There is no 1-1 correspondence between these files and the directory structure. -
DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (
xvc-ecs
) in its core. -
DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This provides inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated and when you want to delete all files associated with a repository, you can do so without the risk of deleting files used in other repositories.
-
DVC considers directories as file-equivalent entities to track with
.dvc
files pointing to.json
files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files. -
DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.