Introduction to Xvc
Xvc is a command line utility to track large files with Git, define dependencies between files to run commands when only these dependencies change, and run experiments by making small changes in these files for later comparison. It's used mostly in Machine Learning scenarios where data and model files are large, code files depend on these and experiments must be compared via various metrics.
Xvc can use S3 and compatible cloud storages to upload tracked files with their exact version and can retrieve these later. This allows to delete them from the project when they are not needed to save space and get them back when needed. This facility can also be used for sharing these files. You can just clone the Git repository and get only the necessary Xvc-tracked files.
Xvc tracks files, directories and other elements by calculating their digests. These digests are used as address to store and find their locations in the storages. When you make a change to a file, it gets a new digest and the changed version has a new address. This makes sure that all versions can be retrieved on demand.
Xvc can be used as a make
replacement to build multi-file projects with complex dependencies. Unlike make
that detect file changes with timestamps, Xvc checks the files via their content. This reduces false-positives in invalidation.
Xvc pipelines are used to define steps to reach to a set of outputs. These steps have commands to run and may (or may not) produce intermediate outputs that other steps depend. Xvc pipelines allows steps to depend on other steps, other pipelines, text and binary files, directories, globs that select a subset of files, certain lines in a file, certain regular expression results, URLs, (hyper)parameter definitions in YAML, JSON or TOML files as of now. More dependency types like environment variables, database tables and queries, S3 buckets, REST query results, generic CLI command results, Bitcoin wallets, Jupyter notebook cells are in the plans.
For example, Xvc can be used to create a pipeline that depends on certain files in a directory via a glob, and a parameter in a YAML file to update a machine learning model. The same feature can be used to build software when the code or artifacts used in the software change. This allow binary outputs (as well as code inputs) to be tracked in Xvc. Instead of building everything from scratch in a new Git clone, a software project can reuse only the portions that require a rebuild. Binary distributions become much simpler.
This book is used as the documentation of the project. It is a work in progress as Xvc, and contain outdated information. Please report any errors and bugs in https://github.com/iesahin/xvc as the rest of project.
Comparison with other tools
There are many similar tools for managing large files on Git, managing machine learning pipelines and experiments. Most of ML oriented tools are provided as SaaS and in a different vein than Xvc.
Similar tools for file management on Git are the following:
dvc
: See Xvc for DVC Users and Benchmarks against DVC documents for a detailed comparison.git-annex
: One of the earliest and most successful projects to manage large files on Git. It supports a large number of remote storage types, as well as adding other utilities as backends, similar toxvc storage new generic
. It features an assistant aimed to make it easier for common use cases. It uses SHA-256 as the single digest option and uses symlinks as a recheck method It doesn't have data pipeline features.git-lfs
: It uses Git internals to track binary files. It requires server support for remote storages and allows only Git remotes to be used for binary file storage. Uses the same digest function Git uses. (By default, SHA-1). Uses.gitattributes
mechanism to track certain files by default. It doesn't have data pipeline features.
Installation
Rust
Linux
macOS
Windows
Compiling Xvc without default features
You may want to customize the feature set when you want a smaller binary size. Not everyone needs all storage options and turning off them may result in smaller binary sizes.
When you turn off all remote storage features, async runtime (tokio
) is also excluded from binary.
cargo build --no-default-features --release
[..]
Finished `release` profile [optimized] target(s) in 4.65s
Compiling Xvc without Reflink support
[reflink] crate may cause compilation errors on platforms where it's not supported.
Xvc adds a reflink
feature flag that's turned on by default. When reflink
causes errors, you can turn off default features and select only those you'll
use.
cargo build --no-default-features --features "reflink" --release
[..]
Finished `release` profile [optimized + debuginfo] target(s) in 56.40s
Note that when you supply --no-default-features
, all other default features
like s3
etc are also turned off. You'll have to specify which features you
want in the features list. Otherwise Xvc cannot connect to your storages.
cargo build --no-default-features --features "s3,wasabi" --release
[..]
Finished `release` profile [optimized + debuginfo] target(s) in 56.40s
Configuration
Configuration Files
Configure with Environment Variables
Changing configuration for a command
Get Started to Xvc
Xvc is a multipurpose tool. Its features can be used by professionals with various roles. If you're working with data, you can benefit from Xvc data management features.
Xvc for Everyone
Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).
🐇 Hello tortoise. How are you? Let's take a selfie. Do you take selfies? I have lots of them. Terabytes of them.
🐢 I don't have much selfies, you know. I don't change quickly and scenery is changing less often.
🐇 I see. I have terabytes of them, but can't find a good solution to store them. How do you store your documents? I know you have documents, lots and lots of them.
🐢 I track them with Git to track my evolving thoughts on text files. Images are different. I think it's not a good idea to keep images on Git, but there is a tool for that.
🐇 What kind of tool? Not Git, but something different?
🐢 It's called Xvc. You can keep track of your selfies with it. You can backup them, and get them as needed.
🐇 Tell me more about it. I have a directory in my home, ~/Selfies
and I have thousands of them. How will I start?
🐢 Xvc can be used as a standalone tool but better when used with Git. You can just type
$ git init
$ xvc init
to start working with Xvc.
🐇 It looks easy but I heard that Git is complicated. Will I need to learn it?
🐢 Ah, no. If you're not willing to learn Git, you can just let Xvc to handle that. By default, it handles all Git operations about the changes it makes. If you want to push your files with someone, you may need to learn how to manage a repository.
🐇 How do I track my files?
🐢 You use xvc file track
command. Do you have directories in ~/Selfies
?
🐇 Yep. I have. Lots of them.
🐢 Do you want to track all of them?
🐇 Almost all. Some of them are so private that I want to hide even from Xvc.
🐢 You can use .xvcignore
file to list them. Xvc ignores the files you list in .xvcignore
.
🐇 How do I add others? Could you give an example?
🐢 If you have a folder for today's selfies, type this in ~/Selfies
$ xvc file track today/
and Xvc will track everything in that directory.
🐇 Oh, that's easy. If I want to track everything not ignored, I can type xvc file track
then.
🐢 You're a quick learner.
After some brief period 🐇 went to home and added files.
🐇 Now, I want to learn how to share my selfies.
🐢 Xvc can store file contents in another location. First you must setup a storage. Do you use AWS S3?
🐇 Yes. I have buckets there. I want to keep my selfies in my rabbit-hole
.
🐢 You can configure Xvc to use it with xvc storage new s3
command. You'll specify the region and bucket, and Xvc will prepare it.
🐇 types
$ xvc storage new s3 --name selfies --region eu-lepus-1 --bucket rabbit-hole
🐢 Now, you can send your files there with xvc file send --to selfies
.
🐇 Is that all?
🐢 You will also need to push your Git files to another place. Do you have a Github account?
🐇 Ah, yeah, I have.
🐢 Now create a repository for your selfies. We will configure Git to use it as origin
.
$ git remote add origin https://github.com/🐇/selfies
$ git push --set-upstream origin main
Now, you can share your selfies with your friends.
🐇 Cool, but how Xvc knows my AWS password? Does it share my passwords?
🐢 No, never. You must allow your friends to read that bucket of yours. Xvc reads the credentials from AWS configuration, either from the file or the environment variables.
🐇 How will they get my files?
🐢 First, they must clone the repository.
$ git clone https://github.com/🐇/selfies
Then, they can get all files with:
$ cd selfies
$ xvc file get .
🐇 Oh, cool, they don't have to xvc init
again? Right?
🐢 No, they don't. Xvc should be initialized only once per repository. When you have new selfies, you can share them with:
$ xvc file track
$ git push
and your friends can receive the changes with
$ git pull
$ xvc file get
🐇 The order of these commands are important, it looks.
🐢 Yep. You add to Xvc first. Xvc automatically commits the changes to Git. Then you push Git changes to remote. Your friends first pull these changes, then get the actual files.
🐇 Thank you tortoise. Let me get back to my hole.
Xvc for Data
Xvc for Machine Learning
Xvc Getting Started pages are written as stories and dialogues between tortoise (🐢) and hare (🐇).
🐇 Ah, hello tortoise. How are you? I began to work as an machine learning engineer, you know? I'll be the fastest.
🐢 You're quick as always, hare. How is your job going so far?
🐇 It's good. We have lots and lots of data. We have models. We have scripts to create those models. We have notebooks full of experiments. That's all good stuff. We'll solve the hare intelligence problem.
🐢 Sounds cool. Aren't you losing yourself in all these, though?
🐇 Time to time we have those moments. Some models work with some data, some experiments require some kind of preprocessing, some data changed since we started to work with it and now we have multiple versions.
🐢 I see. I began to use a tool called Xvc. It may be of use to you.
🐇 What does it do?
🐢 It keeps track of all these stuff you mentioned. Data, models, scripts. It also can detect when data changed and run the scripts associated with that data.
🐇 That sound something we need. My boss wanted me to build a pipeline for cat pictures. He makes a contest for cat pictures. Every time he finds a new cat picture he likes, we have to update the model.
🐢 He must have lots of cat pictures.
🐇 He has. He sometimes find higher resolution versions and replaces older pictures. He has terabytes of cat pictures.
🐢 How do you keep track of those versions?
🐇 We don't. We have a disk for cat pictures. He puts everything there and we train models with it.
🐢 You can use Xvc to version those files. You can go back and forth in time, or have different branches. It's based on Git.
🐇 I know, but Git is for code files, right? I never found a good way to store image files in Git. It stores everything.
🐢 Yep. Git keeps all history in each repository. Better to keep that terabytes of images away from Git. Otherwise, you'll have terabytes of cat pictures in each clone you use. Xvc helps there. It tracks contents of data files separately from Git. Image files are not put into Git objects, and they are not duplicated in all repositories.
🐇 You know, I'm not interested in details. Tell me how this works.
🐢 Ok. When you go back to cat picture directory, create a Git repository, and initialize Xvc immediately.
$ git init
...
$ xvc init
? 0
🐇 No messages?
🐢 Xvc is of silent type of Unix commands. It follows "no news is good news" principle. We use ? 0
to indicate the command return code. 0 means success. If you want more output, you can add -v
as a flag. Increase the number of -v
s to increase the details.
🐇 So -vvvvvvvvvvvvvvv
will show which atoms interact in disk while running Xvc?
🐢 It may work, try that next time. Now, you can add your cat pictures to Xvc. Xvc makes copies of tracked files by default. I assume you have a large collection. Better to make everything symlinks for now. We can change how specific files are linked to cache later.
$ xvc -v file track --as symlink .
🐇 Does it track everything that way?
🐢 Yes. If you want to track only particular files or directories, you can replace .
with their names.
🐇 What's the best recheck method for me?
🐢 If your file system supports, best way seems reflink
to me. It's like a symlink but makes a copy when your file changes. Most of the widely used file systems don't support it though. If your files are read only and you don't have many links to the same files, you can use hardlink
. If they are likely to change, you can use copy
. If there are many links to same files, better to use symlink
.
🐇 So, symlinks are not the best? Why did you select it?
🐢 I suspect most of the files in your cat pictures are duplicates. Xvc stores only one copy of these in cache and links all occurrences in the workspace to this copy. This is called deduplication. There are limits to number of hardlinks, so I recommended you to use symlinks. They are more visible. You can see they are links. Hardlinks are harder to detect.
🐇 Ah, when I type ls -l
, they all show the cache location now.
🐢 If you have a models/
directory and want to track them as copies, you can tell Xvc:
$ xvc file track --recheck-method copy models/
It replaces previous symlinks with the copies of the files only in models/
.
🐇 Can I have my data read only and models writable?
🐢 You can. Xvc keeps track of each file's recheck-method
separately. Data can stay in read-only symlinks, and models can be copied so they can be updated and stored as different versions.
🐇 I have also scripts, what should I do with them?
🐢 Are you using Git for them?
🐇 Yep. They are in a separate repository. I think I can use the same repository now.
🐢 You can. Better to keep them in the same repository. They can be versioned with the data they use and models they produce. You can use standard Git commands to track them. If you track a file with Git, Xvc doesn't track it. It stays away from it.
🐇 You said we can create pipelines with Xvc as well. I created a multi-stage pipeline for cat picture models. It's like this:
graph LR cats["data/cats/"] --> pp-train["preprocess.py --train data/pp-train/"] pp-train --> train["train.py"] params["params.yaml"] --> train cat-ratings["cat-ratings.txt"] --> train train --> model["models/model.bin"] cats --> pp-test["preprocess.py --test data/pp-test/"] model --> test["test.py"] pp-test --> test test --> metrics["metrics.json"] test --> best-model["best-model.json"] best-model --> deploy["deploy.sh"]
🐢 It looks like a fairly complex pipeline. You can create a pipeline definition for it. For each separate command we'll have a step. How many different commands do you have?
🐇 A preprocess --train
command, a preprocess --test
command, a train
command, a test
command and a deploy
command. Five.
🐢 Do you need more than one pipeline? Maybe you would like to put deployment to another pipeline?
🐇 No, I don't think so. I may have in the future.
🐢 Xvc has a default pipeline. We'll use it for now. If you need more pipelines you can create with xvc pipeline new
.
🐇 How do I create step for commands?
🐢 Let's create the steps at once. Each step requires a name and a command.
$ xvc pipeline step new --step-name preprocess-train --command 'python3 src/preprocess.py --train data/cats data/pp-train/'
$ xvc pipeline step new --step-name preprocess-test --command 'python3 src/preprocess.py --test data/cats data/pp-test/'
$ xvc pipeline step new --step-name train --command 'python3 src/train.py data/pp-train/'
$ xvc pipeline step new --step-name test --command 'python3 src/test.py data/pp-test/ metrics.json'
$ xvc pipeline step new --step-name deploy --command 'python3 deploy.py models/model.bin /var/server/files/model.bin'
🐇 How do we define dependencies?
🐢 You can have many different types of dependencies. All are defined by xvc pipeline step dependency
command. You can set up direct dependencies between steps, if one is invalidated, its dependents also run. You can set up file dependencies, if the file changes the step is invalidated and requires to run. There are other, more detailed dependencies like parameter dependencies which take a file in JSON or YAML format, then checks whether a value has changed. There are regular expression dependencies, for example if you have a piece of code in your training script that you change to update the parameters, you can define a regex dependency.
🐇 It looks I can use this for CSV files as well.
🐢 Yes. If your step depends not on the whole CSV file, but only specific rows, you can use regex dependencies. You can also specify line numbers of a file to depend.
🐇 My preprocess.py
script depends on data/cats
directory. My train.py
script depends on params.yaml
for some hyperparameters, and reads 5 Star
ratings from cat-contest.txt
. I want to deploy when the newly produced model is better than the older one by checking best-model.json
. My deployment script doesn't update the deployment if the new model is not the best.
🐢 Let's see. For each step, you can use a single command to define its dependencies. For preprocess.py
you'll depend to the data directory and the script itself. We want to run the step when the script changes. It's like this:
$ xvc pipeline step dependency --step-name preprocess-train --glob 'data/cats/*' --file src/preprocess.py
$ xvc pipeline step dependency --step-name preprocess-test --glob 'data/cats/*' --file src/preprocess.py
$ xvc pipeline step dependency --step-name train --glob 'data/pp-train/*' --file src/train.py --param 'params.yaml::learning_rate' --regex 'cat-contest.csv:/^5,.*'
$ xvc pipeline step dependency --step-name test --glob 'models/*' --directory data/pp-test/
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step dependency <--step-name <STEP_NAME>|--generic <GENERICS>|--url <URLS>|--file <FILES>|--step <STEPS>|--glob_items <GLOB_ITEMS>|--glob <GLOBS>|--param <PARAMS>|--regex_items <REGEX_ITEMS>|--regex <REGEXES>|--line_items <LINE_ITEMS>|--lines <LINES>|--sqlite-query <SQLITE_FILE> <SQLITE_QUERY>>
For more information, try '--help'.
$ xvc pipeline step dependency --step-name deploy --file best-model.json
You must also define the outputs these steps produce, so when the output is missing or dependency is newer than the output, the step will require to rerun.
$ xvc pipeline step output --step-name preprocess-train --directory data/pp-train
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name preprocess-test --directory data/pp-test
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name train --directory models/
? 2
error: unexpected argument '--directory' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name test --file metrics.json --file best-model.json
? 2
error: unexpected argument '--file' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
$ xvc pipeline step output --step-name deploy --file /var/server/files/model.bin
? 2
error: unexpected argument '--file' found
Usage: xvc pipeline step output <--step-name <STEP_NAME>|--output-file <FILES>|--output-metric <METRICS>|--output-image <IMAGES>>
For more information, try '--help'.
🐇 These commands become too long to type. You know, I'm a lazy hare and don't like to type much. Is there an easier way?
🐢 You can try source $(xvc aliases)
in your Bash or Zsh, and get a bunch of aliases for these commands. xvc pipeline step output
becomes xvcpso
, xvc pipeline step dependency
becomes xvcpsd
, etc. You can see the whole list:
$ xvc aliases
alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'
🐇 Oh, there are many more commands.
🐢 Yep. More to come, you can use xvc pipeline export
and after making the changes, you can use xvc pipeline import
.
🐇 I don't need to delete the pipeline to rewrite everything, then?
🐢 You can export a pipeline, edit and import with a different name to test. When you want to run them, you specify their names.
🐇 Ah, yeah, that's the most important part. How do I run?
🐢 xvc pipeline run
, or xvcpr
. It takes the name of the pipeline and runs it. It sorts steps, checks if there are any cycles. The steps musn't have cycles, otherwise it's an infinite loop and computers don't like infinite loops like turtles do. Xvc runs steps in parallel if there are no common dependencies.
🐇 So, if I have multiple preprocessing steps that don't depend each other, they can run in parallel?
🐢 Yeah, they run in parallel. For example in your pipeline preprocess-train
and preprocess-test
can run in parallel, because they don't depend on each other.
🐇 Cool. I want to see the pipeline we created.
🐢 You can see it with xvc pipeline dag
(xvcpd
) It prints a mermaid.js diagram that you can paste to your files.
🐇 Better to have an image of this, maybe.
🐢 I'll inform the developer about it. Please tell him anything you'd like to see in the tool in Github or via email He's extremely introverted but tries to be a nice guy.
🐇 Ah, ok, I'll write to him about this.
Xvc for Software Development
Xvc for DVC Users
DVC is an MLOps utility to track data, pipelines and machine learning experiments on top of Git. Xvc is inspired by DVC in its purpose, but there are major technical differences between these two.
Note that this document refers mostly to Xvc v0.6 and DVC 2.30. Both commands are in development, and similarities and differences may change in time.
Similarities
The purposes of these two commands are similar, and these are alternatives to each other. Both of these aims to manage data, pipelines and experiments of an ML project.
Both of the utilities similarly work on top of Git. DVC became more bound to Git after the introduction of its experiment tracking features. Before that, Git was optional (but recommended) for DVC.
Xvc has the same optional and recommended reliance on Git but all features are available without Git. Xvc uses Git with its CLI interface like a user, without any reliance on a particular library.
Both of these commands use hashing the content to detect changes in files.
Both of these use DAGs to represent pipelines.
Conceptual Differences
stage vs. step: What DVC calls "stage" in a data pipeline, Xvc calls "step." "Stage" has a different meaning in the Git context, and I believe using the same word in a different meaning increases the mental effort to describe and understand.
remmote vs storage: What DVC calls "remote", Xvc calls "storage." This is to emphasize the difference between Xvc storages and Git remotes.
pipeline definitions: In DVC, there is a 1-1 correspondence between
dvc.yaml
files in a repository and the pipelines. When you want to create a
new pipeline, you create a new file in DVC.
In Xvc, pipelines are abstract. They are defined with xvc pipeline
family of commands. No single file contains a
pipeline definition. You can export pipelines to
YAML, JSON, and TOML, and import them after
making changes. Xvc doesn't consider any file format authoritative for
pipelines, and their YAML/JSON/TOML representation may change between versions.
Files in the user workspace; DVC is more liberal in creating files among
user files in the repository. When you add a file to DVC with dvc add
, DVC
creates a .dvc
file next to it. Xvc only creates a .xvc/
directory in the
repository root and only updates .gitignore
files to hide tracked files from
Git. You won't see any files added next to your data files.
cache-type vs recheck-method: Cache type, (or rather recheck method) that is whether a file in the repository is linked to its cached version by copying, reflink, symlink or hardlink is determined repository-wide in DVC. You can either have all your cache links as symlinks, or hardlinks, etc. Xvc tracks these per file, you can have one file symlinked to the cache, another file copied from the cache, etc.
Command Differences
While naming Xvc commands, we tried our best to avoid name clashes with Git.
Having both git push
and dvc push
commands may look beneficial for
understanding at first, as these two are analogous. However, giving the same name
also hides important details that are more difficult to emphasize later. (e.g.
DVC experiments are Git objects that are pushed to Git remotes, while the
files changed during experiments are pushed to DVC remotes.)
dvc add
can be replaced by xvc file track
. dvc add
creates a .dvc
file (formatted in YAML) in the repository. Xvc doesn't
create separate files for tracked paths.
Instead of deleting .dvc
files to remove a file from DVC, you can use xvc file untrack
. It can also restore all versions of
an untracked file to a directory.
dvc check-ignore
can be replaced by xvc check-ignore
. Xvc version can be
used against any other ignore filename. (.gitignore
,.ignore
,
.fooignore
...)
dvc checkout
is replaced by xvc file recheck
.
There is a --recheck-method
(shortened as --as
) option in several Xvc
commands to tell whether to check out as symlink, hardlink, reflink or copy.
dvc commit
is replaced by xvc file carry-in
. They
both cache the files if they are changed.
There is no command similar to dvc config
. You can either edit the
configuration files, or modify configuration with
-c
options in each run. You can also supply all configuration from the
environment. See Configuration.
dvc dag
is replaced by xvc pipeline dag
. DVC version uses ASCII art to
present the pipeline. Xvc doesn't provide ASCII art, instead provides either a
Graphviz representation or mermaid diagram.
dvc data status
and dvc status
can be replaced by xvc file list
. Xvc
version doesn't provide information about the pipelines, or remote storages.
There is no command similar to dvc destroy
in Xvc. There will be an xvc deinit
command at some point. Until then, you can just
delete .xvc/
directory and all .xvcignore
files in your repository to
destroy.
There is no command similar to dvc diff
in Xvc.
There is no command similar to dvc doctor
or dvc version
. Version
information should be visible in the help text. Unless compiled from source
with feature flags, Xvc binaries don't have feature
differences.
Currently, there are no commands corresponding to dvc exp
set of commands.
This is on the roadmap for Xvc. Scope, implementation, and actual commands may
differ.
dvc fetch
is replaced by xvc file bring --no-recheck
.
Instead of freezing "pipeline stages" as in dvc freeze
, and unfreezing with
dvc unfreeze
, xvc pipeline step update --changed [never|always|by_dependencies]
can be used to specify if/when to run a
pipeline step.
Instead of dvc gc
to "garbage-collect" files, you can use xvc file remove
with various options.
There is no corresponding command for dvc get-url
in Xvc. You can use
wget
or curl
instead.
Currently there is no command to replace dvc get
and dvc import
, and dvc import-url
. URL dependencies are supported in the pipeline with xvc pipeline step dependency --url
.
Instead of dvc install
like hooks, Xvc issues Git commands itself if
git.auto_commit
, git.auto_stage
configuration options are set.
There is no corresponding command for dvc list-url
.
dvc list
is replaced by xvc file list
for local
paths. Its remote capabilities are not implemented but is on the roadmap.
Xvc doesn't mix files from different repositories in the same storage. There is an ID for each Xvc repo that's also used in remote storage paths.
Currently, there is no params/metrics tracking/diff similar to dvc params
,
dvc metrics
or dvc plots
commands in Xvc.
dvc move
is replaced by xvc file move
.
dvc push
is replaced by xvc file send
.
dvc pull
is replaced by xvc file bring
.
There are no commands similar to dvc queue
for experiments in Xvc.
Experiment tracking will probably be handled differently.
dvc remote
set of commands are replaced by xvc storage
set of commands.
You can use xvc storage new
for adding new storages. Currently, there is no
"default remote" facility in Xvc. Instead of dvc remote modify
, you can use
xvc storage remove
and xvc storage new
.
There is no single command to replace dvc remove
. For files, you can use
xvc file delete
. For pipelines steps, you can use
]xvc pipeline step remove
Instead of dvc repro
, Xvc has xvc pipeline run
. If you want to reproduce a pipeline, you can
use xvc pipeline run
again.
xvc root
is for the same purpose as dvc root
.
dvc run
(that defines a stage in DVC pipeline and immediately runs it) can
be replaced by xvc pipeline
set of commands. xvc pipeline new
for a new pipeline, xvc pipeline step new
for a new step in the pipeline, xvc pipeline step dependency
to specify
dependencies of a step, xvc pipeline step output
to specify outputs of a step and
xvc pipeline run
to run this pipeline.
Instead of dvc stage add
, we have xvc pipeline step new
. For dvc stage list
, we have xvc pipeline step list
.
There is no (need) for dvc protect
or dvc unprotect
commands in Xvc.
"Cache type" of Xvc is not a repository-wide option, and called "recheck
method". If you want to track a certain directory as
symlink, and another as hardlink, you can do so with xvc file recheck --as
.
If you want identical files copied to one directory and linked in another,
xvc file copy
can help.
DVC needs dvc update
for external dependencies in pipelines. Xvc checks
their metadata like any other dependency before downloading and invalidates the
step if the URL/file has changed automatically.
DVC leaves Git operations to the user, and automates them to a certain degree with Git hooks. Xvc adds Git commits to the repository after operations by default.
Extra Features of Xvc
Xvc can use multiple of hashing functions, like BLAKE3, BLAKE2s, SHA2-256 and SHA3-256. More can be added upon request. The only requirement for hashes is having 32-hex digits (256 bits) of output.
In its pipelines, Xvc has more flexibility in defining dependencies. DVC supports files, directories and hyperparameters. Xvc supports additionally
- globs
- text file lines defined by line numbers,
- text file lines defined by regular expressions,
- URLs
- Sqlite queries,
Technical Differences
-
DVC is written in Python. Xvc is written in Rust.
-
DVC uses MD5 to check file content changes. Xvc uses BLAKE3 by default, and can be configured to use BLAKE2s, SHA2-256 and SHA3-256.
-
DVC tracks file/directory changes in separate
.dvc
files. Xvc tracks them in.json
files in.xvc/store
. There is no 1-1 correspondence between these files and the directory structure. -
DVC uses Object-Oriented Programming in Python. Xvc tries to minimize function/data coupling and uses an Entity-Component System (
xvc-ecs
) in its core. -
DVC remotes are identical to their cache in structure, and multiple DVC repositories use the same remote by mixing files. This provides inter-repository deduplication. Xvc uses separate directory for each repository. This means identical files in separate Xvc repositories are duplicated and when you want to delete all files associated with a repository, you can do so without the risk of deleting files used in other repositories.
-
DVC considers directories as file-equivalent entities to track with
.dvc
files pointing to.json
files in the cache. Xvc doesn't track directories as identical to files. They are considered collections of files. -
DVC uses Dulwich for Git operations. Xvc executes the Git process directly, with its common command line options.
Benchmarking Xvc vs DVC
In this section, we'll write a few tests to see how Xvc and DVC perform in common tasks. This document is planned as reproducible to see the differences in performance. I'll update this time to time to see the differences, and I'll also add more tests.
This is mostly to satisfy my personal curiosity. I don't claim these are scientific experiments that describe the performance in all conditions.
We'll test the tools in the following scenarios:
- Checking in small files: We'll unzip 15.000 images from Chinese-MNIST dataset and measure the time for
dvc add
andxvc file track
- Checking out small files: We'll delete the files we track and recheck / checkout them using
dvc checkout
andxvc recheck
- Pushing/sending the small files we added to S3
- Pulling/bringing the small files we pushed from S3
- Checking in and out large files: We'll create 100 large files using
xvc-test-helper
and repeat the above tests. - Running small pipelines: We'll create a pipeline with 10 steps to run simple commands.
- Running medium sized pipelines: We'll create a pipeline with 100 steps to run simple commands.
- Running large pipelines: We'll create a pipeline with 1000 steps to run simple commands.
Setup
This document uses the most recent versions of Xvc and DVC. DVC is installed via Homebrew.
$ dvc --version
3.30.3
$ xvc --version
xvc v0.6.4-alpha.0-300-g08c034a-modified
Init Repositories
Let's start by measuring the performance of initializing repositories.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ hyperfine -r 1 'xvc init'
Benchmark 1: xvc init
Time (abs ≡): 48.6 ms [User: 11.0 ms, System: 21.3 ms]
$ hyperfine -r 1 'dvc init ; git add .dvc/ .dvcignore ; git commit -m "Init DVC"'
Benchmark 1: dvc init ; git add .dvc/ .dvcignore ; git commit -m "Init DVC"
Time (abs ≡): 425.3 ms [User: 205.7 ms, System: 86.3 ms]
$ git status -s
?? chinese_mnist.zip
Unzip the images
$ unzip -q chinese_mnist.zip
$ zsh -cl 'cp -r data/data xvc-data'
$ zsh -cl 'cp -r data/data dvc-data'
$ tree -d
.
├── data
│ └── data
├── dvc-data
└── xvc-data
5 directories
15K Small Files Performance
Xvc commits the changed metafiles automatically unless otherwise specified in the options. In the DVC command below, we also commit *.dvc
files.
$ hyperfine -r 1 'xvc file track xvc-data/'
Benchmark 1: xvc file track xvc-data/
Time (abs ≡): 3.655 s [User: 0.931 s, System: 12.339 s]
$ hyperfine -r 1 --show-output 'dvc add dvc-data/ '
Benchmark 1: dvc add dvc-data/
To track the changes with git, run:
git add .gitignore dvc-data.dvc
To enable auto staging, run:
dvc config core.autostage true
Time (abs ≡): 13.027 s [User: 4.740 s, System: 6.765 s]
$ lsd -l
$ git status -s
M .gitignore
?? chinese_mnist.zip
?? data/
?? dvc-data.dvc
Checkout a directory with 15K files
$ rm -rf xvc-data
$ hyperfine -r 1 'xvc file recheck xvc-data/'
Benchmark 1: xvc file recheck xvc-data/
Time (abs ≡): 2.378 s [User: 0.438 s, System: 2.152 s]
$ rm -rf dvc-data/
$ ls
chinese_mnist.zip
data
dvc-data.dvc
xvc-data
$ hyperfine -r 1 --show-output 'dvc checkout dvc-data.dvc'
Benchmark 1: dvc checkout dvc-data.dvc
A dvc-data/
Time (abs ≡): 4.102 s [User: 1.399 s, System: 2.155 s]
Large File Performance
$ zsh -cl 'dd if=/dev/urandom of=xvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.669660 secs (628017680 bytes/sec)
$ hyperfine -r 1 'xvc file track xvc-large-file'
Benchmark 1: xvc file track xvc-large-file
Time (abs ≡): 1.499 s [User: 0.816 s, System: 0.805 s]
$ zsh -cl 'dd if=/dev/urandom of=dvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.446919 secs (724695716 bytes/sec)
$ hyperfine -r 1 --show-output 'dvc add dvc-large-file ; git add dvc-large-file.dvc .gitignore ; git commit -m "Added dvc-large-file to DVC"'
Benchmark 1: dvc add dvc-large-file ; git add dvc-large-file.dvc .gitignore ; git commit -m "Added dvc-large-file to DVC"
To track the changes with git, run:
git add dvc-large-file.dvc .gitignore
To enable auto staging, run:
dvc config core.autostage true
[main 72fd199] Added dvc-large-file to DVC
2 files changed, 6 insertions(+)
create mode 100644 dvc-large-file.dvc
Time (abs ≡): 2.153 s [User: 1.906 s, System: 0.203 s]
Commit/Carry-in Large Files
$ zsh -cl 'dd if=/dev/urandom of=xvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.550065 secs (676472277 bytes/sec)
$ hyperfine -r 1 'xvc file carry-in xvc-large-file'
Benchmark 1: xvc file carry-in xvc-large-file
Time (abs ≡): 1.024 s [User: 0.629 s, System: 0.393 s]
$ zsh -cl 'dd if=/dev/urandom of=dvc-large-file bs=1M count=1000'
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 1.550363 secs (676342250 bytes/sec)
$ hyperfine -r 1 --show-output 'dvc add dvc-large-file ; git add dvc-large-file.dvc ; git commit -m "Added dvc-large-file to DVC"'
Benchmark 1: dvc add dvc-large-file ; git add dvc-large-file.dvc ; git commit -m "Added dvc-large-file to DVC"
To track the changes with git, run:
git add dvc-large-file.dvc
To enable auto staging, run:
dvc config core.autostage true
[main c74d783] Added dvc-large-file to DVC
1 file changed, 1 insertion(+), 1 deletion(-)
Time (abs ≡): 2.098 s [User: 1.903 s, System: 0.189 s]
Pipeline with 10 Steps
Pipeline steps will depend on the following files.
$ xvc-test-helper create-directory-tree --directories 1 --files 10 --root pipeline-10
$ tree pipeline-10
pipeline-10
└── dir-0001
├── file-0001.bin
├── file-0002.bin
├── file-0003.bin
├── file-0004.bin
├── file-0005.bin
├── file-0006.bin
├── file-0007.bin
├── file-0008.bin
├── file-0009.bin
└── file-0010.bin
2 directories, 10 files
Let's create 10 DVC stages to depend on these files:
$ zsh -cl "for f in pipeline-10/dir-0001/* ; do dvc stage add -q -n ${f:r:t} -d ${f} 'sha1sum $f'; done"
$ dvc stage list
file-0001 Depends on pipeline-10/dir-0001/file-0001.bin
file-0002 Depends on pipeline-10/dir-0001/file-0002.bin
file-0003 Depends on pipeline-10/dir-0001/file-0003.bin
file-0004 Depends on pipeline-10/dir-0001/file-0004.bin
file-0005 Depends on pipeline-10/dir-0001/file-0005.bin
file-0006 Depends on pipeline-10/dir-0001/file-0006.bin
file-0007 Depends on pipeline-10/dir-0001/file-0007.bin
file-0008 Depends on pipeline-10/dir-0001/file-0008.bin
file-0009 Depends on pipeline-10/dir-0001/file-0009.bin
file-0010 Depends on pipeline-10/dir-0001/file-0010.bin
Run the DVC pipeline
$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
Time (abs ≡): 766.8 ms [User: 482.4 ms, System: 218.7 ms]
Running without changed the dependencies
$ hyperfine -M 5 "dvc repro"
Benchmark 1: dvc repro
Time (mean ± σ): 455.8 ms ± 22.6 ms [User: 342.3 ms, System: 107.4 ms]
Range (min … max): 431.0 ms … 492.3 ms 5 runs
$ zsh -cl "for f in pipeline-10/dir-0001/* ; do xvc pipeline step new -s ${f:r:t} --command 'sha1sum $f' ; xvc pipeline step dependency -s ${f:r:t} --file ${f} ; done"
$ hyperfine -r 1 "xvc pipeline run"
Benchmark 1: xvc pipeline run
Time (abs ≡): 229.8 ms [User: 53.9 ms, System: 227.3 ms]
$ hyperfine -M 5 "xvc pipeline run"
Benchmark 1: xvc pipeline run
Time (mean ± σ): 176.8 ms ± 4.0 ms [User: 34.6 ms, System: 144.1 ms]
Range (min … max): 173.0 ms … 183.0 ms 5 runs
Pipeline with 100 Steps
Pipeline steps will depend on the following files.
$ xvc-test-helper create-directory-tree --directories 1 --files 100 --root pipeline-100
$ tree -d pipeline-100
pipeline-100
└── dir-0001
2 directories
$ rm -f dvc.yaml
$ zsh -cl "for f in pipeline-100/dir-0001/* ; do dvc stage add -q -n s-${RANDOM} -d ${f} 'sha1sum $f'; done"
$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
Time (abs ≡): 10.383 s [User: 8.813 s, System: 1.072 s]
$ hyperfine -M 5 "dvc repro"
Benchmark 1: dvc repro
Time (mean ± σ): 637.3 ms ± 9.8 ms [User: 467.4 ms, System: 161.1 ms]
Range (min … max): 630.2 ms … 654.3 ms 5 runs
Let's create 100 Xvc steps to depend on the same files.
$ xvc pipeline new --pipeline-name p100
$ zsh -cl "for f in pipeline-100/dir-0001/* ; do xvc pipeline -p p100 step new -s ${f:r:t} --command 'sha1sum $f' ; xvc pipeline -p p100 step dependency -s ${f:r:t} --file ${f} ; done"
$ hyperfine -r 1 --show-output "xvc pipeline -p p100 run"
Benchmark 1: xvc pipeline -p p100 run
Time (abs ≡): 201.9 ms [User: 39.6 ms, System: 168.4 ms]
$ hyperfine -M 5 "xvc pipeline -p p100 run"
Benchmark 1: xvc pipeline -p p100 run
Time (mean ± σ): 198.7 ms ± 3.1 ms [User: 39.9 ms, System: 163.9 ms]
Range (min … max): 196.0 ms … 203.8 ms 5 runs
Note that the first run of the commands is drastically different. DVC runs all stages sequentially, in around 9.3 seconds while Xvc runs them in parallel in 0.2 seconds. Let's also measure the average run time of a sha1sum
command to consider how much of these passes in actual commands.
$ hyperfine 'sha1sum pipeline-100/dir-0001/file-0001.bin'
Benchmark 1: sha1sum pipeline-100/dir-0001/file-0001.bin
Time (mean ± σ): 1.2 ms ± 0.2 ms [User: 0.4 ms, System: 0.5 ms]
Range (min … max): 0.9 ms … 2.7 ms 535 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Pipeline with 1000 Steps
In this case we'll just measure the run times of 10000 ls
commands.
$ rm -f dvc.yaml
$ zsh -cl "for i in {1..1000}; do dvc stage add -q -n s-${i} 'ls'; done"
$ zsh -cl 'dvc stage list | wc -l'
1000
$ hyperfine -r 1 "dvc repro"
Benchmark 1: dvc repro
Time (abs ≡): 469.534 s [User: 449.463 s, System: 17.257 s]
$ hyperfine -M 5 "dvc repro"
? interrupted
Benchmark 1: dvc repro
$ xvc pipeline new --pipeline-name p1000
$ zsh -cl "for i in {1..1000} ; do xvc --skip-git pipeline -p p1000 step new -s s-${i} --command 'ls' ; done"
$ zsh -cl 'xvc pipeline step list --names-only | wc -l'
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
10
$ hyperfine -r 1 --show-output "xvc pipeline -p p1000 run"
Benchmark 1: xvc pipeline -p p1000 run
Time (abs ≡): 460.0 ms [User: 78.7 ms, System: 376.8 ms]
$ hyperfine -M 5 "xvc pipeline -p p1000 run"
Benchmark 1: xvc pipeline -p p1000 run
Time (mean ± σ): 404.5 ms ± 10.6 ms [User: 79.0 ms, System: 366.7 ms]
Range (min … max): 397.4 ms … 423.2 ms 5 runs
How-To Guides
How to Compile Xvc
Why would you compile?
- You want to use Xvc on a platform that we don't distribute the binary.
- You want a smaller binary size by removing features that you don't use.
- You like your software compiled.
- It's easier to use
cargo
than other means to install for you. - Fix a bug for yourself.
- Contribute!
Install Rust
You must have Rust installed on your system.
If you have a sensible terminal on your system:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Otherwise refer to other installation methods page.
Clone the repository
Clone the repository from Emre's Github repository.
$ git clone https://github.com/iesahin/xvc -b latest
The latest
tag refers to the latest stable release. If you're willing to fight with compilation errors, you can also use main
branch directly.
Compile without default features
Xvc with Git Branches
When you're working with multiple branches in Git, you may ask Xvc to checkout a branch and commit to another branch.
These operations are performed at the beginning, and at the end of Xvc operations.
You can use --from-ref
and --to-branch
options to checkout a Git reference before an Xvc operation, and commit the results to a certain Git branch.
Checkout and commit operations sandwich Xvc operations.
graph LR checkout["git checkout $REF"] --> xvc xvc["xvc operation"] --> stash["git stash --staged"] stash --> branch["git checkout --branch $TO_BRANCH"] branch --> commit["git add .xvc && git commit"]
If --from-ref
is not given, initial git checkout
is not performed.
Xvc operates in the current branch.
This is the default behavior.
$ git init --initial-branch=main
...
$ xvc init
? 0
$ ls
data.txt
$ xvc --to-branch data-file file track data.txt
Switched to a new branch 'data-file'
$ git branch
* data-file
main
$ git status -s
$ xvc file list data.txt
FC 19 2023-06-08 11:47:18 c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
If you return to main
branch, you'll see the file is tracked by neither Git nor Xvc.
$ git checkout main
...
$ xvc file list data.txt
FX 19 2023-06-08 11:47:18 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 0
$ git status -s
?? data.txt
Now, we'll add a step to the default pipeline to get an uppercase version of the data. We want this to work only in data
$ xvc --from-ref data-file pipeline step new --step-name to-uppercase --command 'cat data.txt | tr a-z A-Z > uppercase.txt'
Switched to branch 'data-file'
$ xvc pipeline step dependency --step-name to-uppercase --file data.txt
$ xvc pipeline step output --step-name to-uppercase --output-file uppercase.txt
Note that xvc pipeline step dependency
and xvc pipeline step output
commands don't need --from-ref
and --to-branch
options, as they run in data-file
branch already.
Now, we want to have this new version of data available only in uppercase
branch.
$ xvc --from-ref data-file --to-branch uppercase pipeline run
Already on 'data-file'
[DONE] to-uppercase (cat data.txt | tr a-z A-Z > uppercase.txt)
Switched to a new branch 'uppercase'
$ git branch
data-file
main
* uppercase
You can use this for experimentation.
Whenever you have a pipeline that you want to run and keep the results in another Git branch, you can use --to-branch
for experimentation.
$ xvcpr --from-ref data-file --to-branch another-uppercase
$ git-branch
* another-uppercase
uppercase
data-file
main
The pipeline always runs, because in data-file
branch uppercase.txt
is always missing.
It's stored only in the resulting branch you give by --to-branch
.
Turning off Automated Git Operations
By default Xvc automates all common git operations. When you run an Xvc operation that affects the files under .xvc
directory, the changes are committed to the repository automatically.
Git autmation runs in Git repositories.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ xvc init
We'll show these examples in the following directory tree.
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20231012
$ tree
.
└── dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
2 directories, 3 files
When you begin to track a file in the repository, Xvc adds the file to .gitignore in the directory the file is found.
$ xvc file track dir-0001/file-0001.bin
$ zsh -cl 'cat dir-0001/.gitignore'
### Following 1 lines are added by xvc on [..]
/file-0001.bin
Xvc also adds a commit for all the changes caused by the command.
$ git log -n 1
commit [..]
Author: [..]
Date: [..]
Xvc auto-commit after '[..]xvc file track dir-0001/file-0001.bin'
The commit message includes the command you gave to run to find the exact change in history.
If you don't track a file with Xvc, they are not added to .gitignore
and you can see them with git status
.
$ git status -s
?? dir-0001/file-0002.bin
?? dir-0001/file-0003.bin
If you want to skip this automated Git operations, you can add --skip-git
flag to commands.
$ xvc --skip-git file track dir-0001/file-0002.bin
$ git status -s
M dir-0001/.gitignore
?? .xvc/ec/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? dir-0001/file-0003.bin
Note that, --skip-git
flag doesn't affect the files to be added to .gitignore
files.
$ zsh -cl 'cat dir-0001/.gitignore'
### Following 1 lines are added by xvc on [..]
/file-0001.bin
### Following 1 lines are added by xvc on [..]
/file-0002.bin
You can use usual Git workflow to add and commit the files.
$ git add .xvc dir-0001/.gitignore
$ git commit -m "Began to track dir-0001/file-0002.bin with Xvc"
[main [..]] Began to track dir-0001/file-0002.bin with Xvc
7 files changed, 8 insertions(+)
create mode 100644 .xvc/ec/[..]
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
create mode 100644 .xvc/store/[..].json
If you never want Xvc to handle commits, you can set git.use_git
option in
.xvc/config
file to false or set XVC_git.use_git=false
in the environment.
$ XVC_git.use_git=false xvc file track dir-0001/file-0003.bin
$ git status -s
M dir-0001/.gitignore
?? .xvc/ec/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
?? .xvc/store/[..]
How to create a data pipeline with Xvc
A data pipeline starts from data and ends with models. Between there is various data transformations and model training. We try to make all pieces reproducible and Xvc helps with this goal.
In this document, we'll create the following pipeline for a digit recognition system. Our purpose is to show how Xvc helps in versioning data, so this document doesn't try to achieve a high classification performance.
graph LR A[Data Gathering] --> B[Splitting Test and Train Sets] B --> C[Preprocessing Images into Numpy Arrays] C --> D[Training Model] D --> E[Sharing Data and Models]
This document can be more verbose than usual, because all commands in this document are run on a clean directory during tests to check outputs. Some of the idiosyncrasies, e.g., running certain commands with zsh -c
are due to this reason.
Although you can do without, most of the times Xvc runs in a Git repository. This allows to version control both the data and the code together.
$ git init
Initialized empty Git repository in [CWD]/.git/
$ xvc init
In this HOWTO, we use Chinese MNIST dataset to create an image classification pipeline. We already downloaded it from kaggle.
$ ls -l
total 21112
-rw-r--r-- 1 iex staff 10792680 Nov 17 19:46 chinese_mnist.zip
-rw-r--r-- 1 iex staff 1124 Nov 28 14:27 image_to_numpy_array.py
-rw-r--r-- 1 iex staff 40 Dec 1 11:59 requirements.txt
-rw-r--r-- 1 iex staff 4436 Dec 1 22:52 train.py
Let's start by tracking the data file with Xvc.
$ xvc file track chinese_mnist.zip --as symlink
The default recheck (checkout) method is copy that means the file is duplicated in the workspace as a writable file. We don't need to write over this data file, we'll only read from it, so we set the recheck type as symlink.
$ ls -l
total 32
lrwxr-xr-x 1 iex staff 195 Dec 2 12:10 chinese_mnist.zip -> [CWD]/.xvc/b3/b24/2c9/422f91b804ea3008bc0bc025e97bf50c1d902ae7a0f13588b84f59023d/0.zip
-rw-r--r-- 1 iex staff 1124 Nov 28 14:27 image_to_numpy_array.py
-rw-r--r-- 1 iex staff 40 Dec 1 11:59 requirements.txt
-rw-r--r-- 1 iex staff 4436 Dec 1 22:52 train.py
The long directory name is the BLAKE-3 hash of the data file.
As we'll work with the file contents, let's unzip the data file.
$ unzip -q chinese_mnist.zip
$ ls -l
total 32
lrwxr-xr-x 1 iex staff 195 Dec 2 12:10 chinese_mnist.zip -> [CWD]/.xvc/b3/b24/2c9/422f91b804ea3008bc0bc025e97bf50c1d902ae7a0f13588b84f59023d/0.zip
drwxr-xr-x 4 iex staff 128 Nov 17 19:45 data
-rw-r--r-- 1 iex staff 1124 Nov 28 14:27 image_to_numpy_array.py
-rw-r--r-- 1 iex staff 40 Dec 1 11:59 requirements.txt
-rw-r--r-- 1 iex staff 4436 Dec 1 22:52 train.py
Now we have the data directory with the following structure:
$ tree -d data
data
└── data
2 directories
Let's track the data directory as well with Xvc.
$ xvc file track data --as symlink
The reason we're tracking the data directory separately is that we'll use different subsets as training, validation, and test data.
Let's list the track status of files first.
$ xvc file list data/data/input_9_9_*
SS [..] 3a714d65 data/data/input_9_9_9.jpg
SS [..] 9ffccc4d data/data/input_9_9_8.jpg
SS [..] 5d6312a4 data/data/input_9_9_7.jpg
SS [..] 7a0ddb0e data/data/input_9_9_6.jpg
SS [..] 2047d7f3 data/data/input_9_9_5.jpg
SS [..] 10fcf309 data/data/input_9_9_4.jpg
SS [..] 0bdcd918 data/data/input_9_9_3.jpg
SS [..] aebcbc03 data/data/input_9_9_2.jpg
SS [..] 38abd173 data/data/input_9_9_15.jpg
SS [..] 7c6a9003 data/data/input_9_9_14.jpg
SS [..] a9f04ad9 data/data/input_9_9_13.jpg
SS [..] 2d372f95 data/data/input_9_9_12.jpg
SS [..] 8fe799b4 data/data/input_9_9_11.jpg
SS [..] ee35e5d5 data/data/input_9_9_10.jpg
SS [..] 7576894f data/data/input_9_9_1.jpg
Total #: 15 Workspace Size: 2925 Cached Size: 8710
xvc file list
command shows the tracking status. Initial two characters shows
the tracking status, SS
means the file is tracked as symlink and is available
in the workspace as a symlink. The next column shows the file size, then the
last modified date, then the BLAKE-3 hash of the file, and finally the file
name. The empty column contains the actual hash of the file if the file is
available in the workspace. Here it's empty because the workspace file is a
link to the file in cache.
The summary line shows the total size of the files and the size they occupy in the workspace.
Splitting Train, Validation, and Test Sets
The first step of the pipeline is to create subsets of the data.
The data set contains 15 classes. It has 10 samples for each of these classes from 100 different people. As we'll train a Chinese digit recognizer, we'll first divide volunteers 1-60 for training, 61-80 for validation, and 81-100 for testing. This will ensure that the model is not trained with the same person's handwriting.
$ xvc file copy --name-only data/data/input_?_* data/train/
$ xvc file copy --name-only data/data/input_[12345]?_* data/train/
$ xvc file copy --name-only data/data/input_100_* data/train/
$ xvc file copy --name-only data/data/input_[67]?_* data/validate/
$ xvc file copy --name-only data/data/input_[89]?_* data/test/
$ tree -d data/
data/
├── data
├── test
├── train
└── validate
5 directories
If you look at the contents of these directories, you'll see that they are symbolic links to the same files we started to track.
Let's check the number of images in each set.
$ zsh -c 'ls -1 data/train/*.jpg | wc -l'
9000
$ zsh -c 'ls -1 data/validate/*.jpg | wc -l'
3000
$ zsh -c 'ls -1 data/test/*.jpg | wc -l'
3000
The first step in the pipeline will be rechecking (checking out) these subsets.
$ xvc pipeline step new -s recheck-data --command 'xvc file recheck data/train/ data/validate/ data/test/'
xvc file recheck
is used in to instate files from Xvc cache.
Let's test the pipeline by first deleting the files we manually created.
$ rm -rf data/train data/validate data/test
We run the steps we created.
$ xvc pipeline run
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
If we check the contents of the directories, we'll see that they are back.
$ zsh -c 'ls -1 data/train/*.jpg | wc -l'
9000
Preprocessing Images into Numpy Arrays
graph LR A[Data Gathering ✅] --> B[Splitting Test and Train Sets ✅] B --> C[Preprocessing Images into Numpy Arrays] C --> D[Training Model] D --> E[Sharing Data and Models]
The Python script to train a model runs with Numpy arrays. So we'll convert each of these directories with images into two numpy arrays. One of the arrays will keep $n$ 64x64 images and the other will keep $n$ labels for these images.
$ xvc pipeline step new --step-name create-train-array --command '.venv/bin/python3 image_to_numpy_array.py --dir data/train/'
$ xvc pipeline step new --step-name create-test-array --command '.venv/bin/python3 image_to_numpy_array.py --dir data/test/'
$ xvc pipeline step new --step-name create-validate-array --command '.venv/bin/python3 image_to_numpy_array.py --dir data/validate/'
These commands will run when the image files in those directories will change. Xvc can keep track of file groups and invalidate a step when the content of any of these files change. Moreover, it's possible to track which files have changed if there are too many files. We don't need this feature of tracking individual items in globs, so we'll use a glob dependency.
$ xvc pipeline step dependency --step-name create-train-array --glob 'data/train/*.jpg'
$ xvc pipeline step dependency --step-name create-test-array --glob 'data/test/*.jpg'
$ xvc pipeline step dependency --step-name create-validate-array --glob 'data/validate/*.jpg'
Now we have three more steps that depend on changed files. The script depends on OpenCV to read images. Python best practices recommend to create a separate virtual environment for each project. We'll also make sure that the venv is created and the requirements are installed before running the script.
Create a command to initialize the virtual environment. It will run if there is no .venv/bin/activate
file.
$ xvc pipeline step new --step-name init-venv --command 'python3 -m venv .venv'
$ xvc pipeline step dependency --step-name init-venv --generic 'echo "$(hostname)/$(pwd)"'
We used --generic
dependency that runs a command and checks its output to see whether the step requires to be run again. We only want to run init-env
once per deployment, so checking output of hostname
and pwd
is better than existence of a file. File dependencies must be available before running the pipeline to record their metadata. There is no such restriction for generic dependencies.
Then, another step that depends on init-venv
and requirements.txt
will install the dependencies.
$ xvc pipeline step new --step-name install-requirements --command '.venv/bin/python3 -m pip install -r requirements.txt'
$ xvc pipeline step dependency --step-name install-requirements --step init-venv
$ xvc pipeline step dependency --step-name install-requirements --file requirements.txt
Note that, unlike other tools, you can specify direct dependencies between steps in Xvc. When a pipeline step must wait another step to finish successfully, a dependency between these two can be defined.
The above create-*-array
steps will depend on to install-requirements
to ensure that requirements are installed when the scripts are run.
$ xvc pipeline step dependency --step-name create-train-array --step install-requirements
$ xvc pipeline step dependency --step-name create-validate-array --step install-requirements
$ xvc pipeline step dependency --step-name create-test-array --step install-requirements
Now, as the pipeline grows, it may be nice to see the graph what we have done so far.
$ xvc pipeline dag --format mermaid
flowchart TD
n0["recheck-data"]
n1["create-train-array"]
n2["data/train/*.jpg"] --> n1
n3["install-requirements"] --> n1
n4["create-test-array"]
n5["data/test/*.jpg"] --> n4
n3["install-requirements"] --> n4
n6["create-validate-array"]
n7["data/validate/*.jpg"] --> n6
n3["install-requirements"] --> n6
n8["init-venv"]
n9["echo "$(hostname)/$(pwd)""] --> n8
n3["install-requirements"]
n8["init-venv"] --> n3
n10["requirements.txt"] --> n3
flowchart TD n0["recheck-data"] n1["create-train-array"] n2["data/train/*.jpg"] --> n1 n3["install-requirements"] --> n1 n4["create-test-array"] n5["data/test/*.jpg"] --> n4 n3["install-requirements"] --> n4 n6["create-validate-array"] n7["data/validate/*.jpg"] --> n6 n3["install-requirements"] --> n6 n8["init-venv"] n9[".venv/bin/activate"] --> n8 n3["install-requirements"] n8["init-venv"] --> n3 n10["requirements.txt"] --> n3
dag
command can also produce GraphViz DOT output. For larger graphs, it may be more suitable. We'll use DOT to create images in later sections.
Let's run the pipeline at this point to test.
$ xvc -vv pipeline run
[INFO] Found explicit dependency: XvcStep { name: "create-validate-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-train-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-test-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "install-requirements" } -> Step(StepDep { name: "init-venv" })
[INFO][pipeline/src/pipeline/mod.rs::343] Pipeline Graph:
digraph {
0 [ label = "(30009, 11376621678660215310)" ]
1 [ label = "(30012, 12907533602545881359)" ]
2 [ label = "(30010, 8484021102039729264)" ]
3 [ label = "(30011, 9338166212381570306)" ]
4 [ label = "(30016, 17450406389616117859)" ]
5 [ label = "(30018, 2681008057348839262)" ]
1 -> 5 [ label = "Step" ]
2 -> 5 [ label = "Step" ]
3 -> 5 [ label = "Step" ]
5 -> 4 [ label = "Step" ]
}
[INFO] No dependency steps for step recheck-data
[INFO] Waiting for dependency steps for step create-validate-array
[INFO] No dependency steps for step init-venv
[INFO] [recheck-data] Dependencies has changed
[INFO] Waiting for dependency steps for step install-requirements
[INFO] Waiting for dependency steps for step create-test-array
[INFO] Waiting for dependency steps for step create-train-array
[INFO] [init-venv] Dependencies has changed
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
[DONE] init-venv (python3 -m venv .venv)
[INFO] Dependency steps completed successfully for step install-requirements
[INFO] [install-requirements] Dependencies has changed
[OUT] [install-requirements] Collecting opencv-python (from -r requirements.txt (line 1))
Using cached opencv_python-4.8.1.78-cp37-abi3-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting torch (from -r requirements.txt (line 2))
Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting pyyaml (from -r requirements.txt (line 3))
Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting scikit-learn (from -r requirements.txt (line 4))
Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting numpy>=1.21.2 (from opencv-python->-r requirements.txt (line 1))
Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
Collecting filelock (from torch->-r requirements.txt (line 2))
Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting typing-extensions (from torch->-r requirements.txt (line 2))
Using cached typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch->-r requirements.txt (line 2))
Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch->-r requirements.txt (line 2))
Using cached networkx-3.2.1-py3-none-any.whl.metadata (5.2 kB)
Collecting jinja2 (from torch->-r requirements.txt (line 2))
Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting fsspec (from torch->-r requirements.txt (line 2))
Using cached fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=1.5.0 (from scikit-learn->-r requirements.txt (line 4))
Using cached scipy-1.11.4-cp311-cp311-macosx_12_0_arm64.whl.metadata (165 kB)
Collecting joblib>=1.1.1 (from scikit-learn->-r requirements.txt (line 4))
Using cached joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn->-r requirements.txt (line 4))
Using cached threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch->-r requirements.txt (line 2))
Using cached MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl.metadata (3.0 kB)
Collecting mpmath>=0.19 (from sympy->torch->-r requirements.txt (line 2))
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Using cached opencv_python-4.8.1.78-cp37-abi3-macosx_11_0_arm64.whl (33.1 MB)
Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl (59.6 MB)
Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl (167 kB)
Using cached scikit_learn-1.3.2-cp311-cp311-macosx_12_0_arm64.whl (9.4 MB)
Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
Using cached scipy-1.11.4-cp311-cp311-macosx_12_0_arm64.whl (29.7 MB)
Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Using cached filelock-3.13.1-py3-none-any.whl (11 kB)
Using cached fsspec-2023.10.0-py3-none-any.whl (166 kB)
Using cached networkx-3.2.1-py3-none-any.whl (1.6 MB)
Using cached typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Using cached MarkupSafe-2.1.3-cp311-cp311-macosx_10_9_universal2.whl (17 kB)
Installing collected packages: mpmath, typing-extensions, threadpoolctl, sympy, pyyaml, numpy, networkx, MarkupSafe, joblib, fsspec, filelock, scipy, opencv-python, jinja2, torch, scikit-learn
Successfully installed MarkupSafe-2.1.3 filelock-3.13.1 fsspec-2023.10.0 jinja2-3.1.2 joblib-1.3.2 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.2 opencv-python-4.8.1.78 pyyaml-6.0.1 scikit-learn-1.3.2 scipy-1.11.4 sympy-1.12 threadpoolctl-3.2.0 torch-2.1.1 typing-extensions-4.8.0
[DONE] install-requirements (.venv/bin/python3 -m pip install -r requirements.txt)
[INFO] Dependency steps completed successfully for step create-validate-array
[INFO] Dependency steps completed successfully for step create-train-array
[INFO] Dependency steps completed successfully for step create-test-array
[INFO] [create-validate-array] Dependencies has changed
[INFO] [create-train-array] Dependencies has changed
[INFO] [create-test-array] Dependencies has changed
[DONE] create-validate-array (.venv/bin/python3 image_to_numpy_array.py --dir data/validate/)
[DONE] create-test-array (.venv/bin/python3 image_to_numpy_array.py --dir data/test/)
[DONE] create-train-array (.venv/bin/python3 image_to_numpy_array.py --dir data/train/)
Now, when we take a look at the data directories, we find images.npy
and classes.npy
files.
$ zsh -cl 'ls -l data/train/*.npy'
-rw-r--r-- 1 iex staff 72128 Dec 2 12:11 data/train/classes.npy
-rw-r--r-- 1 iex staff 110592128 Dec 2 12:11 data/train/images.npy
$ zsh -cl 'ls -l data/test/*.npy'
-rw-r--r-- 1 iex staff 24128 Dec 2 12:11 data/test/classes.npy
-rw-r--r-- 1 iex staff 36864128 Dec 2 12:11 data/test/images.npy
$ zsh -cl 'ls -l data/validate/*.npy'
-rw-r--r-- 1 iex staff 24128 Dec 2 12:11 data/validate/classes.npy
-rw-r--r-- 1 iex staff 36864128 Dec 2 12:11 data/validate/images.npy
Train a model
Now we have built the NumPy arrays, we can train a model. We'll use a simple convolutional neural network as a showcase. This is by no means a state-of-art solution, so the results will be less than perfect.
graph LR A[Data Gathering ✅] --> B[Splitting Test and Train Sets ✅] B --> C[Preprocessing Images into Numpy Arrays ✅] C --> D[Training Model] D --> E[Sharing Data and Models]
The script receives training, validation and testing directories, loads the data from Numpy arrays we just produced, loads hyperparameters from a file called params.yaml
, trains the model, tests it and writes the results and model to a file. It's a very involved piece produced with the assistance of GPT-4.
We first define the step to run the command:
$ xvc pipeline step new --step-name train-model --command '.venv/bin/python3 train.py --train_dir data/train/ --val_dir data/validate --test_dir data/test'
The step will depend to array generation steps by depending on the files they produce. In order to define a dependency between train-model
and create-train-array
step, we must tell that create-array-dependency
outputs a file called images.npy
. We can do this by using --file
option of step output
command.
$ xvc pipeline step output --step-name create-train-array --output-file data/train/images.npy
$ xvc pipeline step output --step-name create-train-array --output-file data/train/classes.npy
$ xvc pipeline step dependency --step-name train-model --file data/train/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/train/classes.npy
Note that this operation is different from creating a direct dependency between steps. There may be multiple steps creating the same outputs and there may be multiple steps depending on the same files. Preferring direct (--step
) dependencies and indirect (--file
) dependencies is a matter of taste and use.
We'll create these dependencies for other files as well.
$ xvc pipeline step output --step-name create-test-array --output-file data/test/images.npy
$ xvc pipeline step output --step-name create-test-array --output-file data/test/classes.npy
$ xvc pipeline step dependency --step-name train-model --file data/test/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/test/classes.npy
$ xvc pipeline step output --step-name create-validate-array --output-file data/validate/images.npy
$ xvc pipeline step output --step-name create-validate-array --output-file data/validate/classes.npy
$ xvc pipeline step dependency --step-name train-model --file data/validate/images.npy
$ xvc pipeline step dependency --step-name train-model --file data/validate/classes.npy
Before running the pipeline, let's see the pipeline DAG once more. This time in DOT format.
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="recheck-data";];n1[shape=box;label="create-train-array";];n2[shape=folder;label="data/train/*.jpg";];n2->n1;n3[shape=box;label="install-requirements";];n3->n1;n4[shape=note;color=black;label="data/train/images.npy";];n1->n4;n5[shape=note;color=black;label="data/train/classes.npy";];n1->n5;n6[shape=box;label="create-test-array";];n7[shape=folder;label="data/test/*.jpg";];n7->n6;n3[shape=box;label="install-requirements";];n3->n6;n8[shape=note;color=black;label="data/test/images.npy";];n6->n8;n9[shape=note;color=black;label="data/test/classes.npy";];n6->n9;n10[shape=box;label="create-validate-array";];n11[shape=folder;label="data/validate/*.jpg";];n11->n10;n3[shape=box;label="install-requirements";];n3->n10;n12[shape=note;color=black;label="data/validate/images.npy";];n10->n12;n13[shape=note;color=black;label="data/validate/classes.npy";];n10->n13;n14[shape=box;label="init-venv";];n15[shape=trapezium;label="echo /"$(hostname)/$(pwd)/"";];n15->n14;n3[shape=box;label="install-requirements";];n14[shape=box;label="init-venv";];n14->n3;n16[shape=note;label="requirements.txt";];n16->n3;n17[shape=box;label="train-model";];n4[shape=note;label="data/train/images.npy";];n4->n17;n5[shape=note;label="data/train/classes.npy";];n5->n17;n8[shape=note;label="data/test/images.npy";];n8->n17;n9[shape=note;label="data/test/classes.npy";];n9->n17;n12[shape=note;label="data/validate/images.npy";];n12->n17;n13[shape=note;label="data/validate/classes.npy";];n13->n17;}
It's not the most readable graph description but you can feed the output to dot
command to create an SVG file.
$ zsh -cl 'xvc pipeline dag | dot -Tsvg > pipeline1.svg'
Note that, as we forgot to create a params.yaml
file containing the hyperparameters. When a step in the pipeline doesn't run successfully, its dependent steps won't be run. Let's add a params.yaml
file and add it as a dependency to the train step.
$ zsh -cl 'echo "batch_size: 4" > params.yaml'
$ zsh -cl 'echo "epochs: 2" >> params.yaml'
$ xvc pipeline step dependency --step-name train-model --param params.yaml::batch_size
$ xvc pipeline step dependency --step-name train-model --param params.yaml::epochs
With the above commands, the pipeline depends directly to these values. Even if
the file contains other values, changing them won't invalidate the
train-model
step.
We can also specify the model and the results as output and the graph will show them.
$ xvc pipeline step output --step-name train-model --output-file model.pth
$ xvc pipeline step output --step-name train-model --output-metric results.json
Let's see the pipeline DAG once more:
$ zsh -cl 'xvc pipeline dag | dot -Tsvg > pipeline2.svg'
We're ready to run the pipeline and train the model.
$ xvc -vv pipeline run
[INFO] Found explicit dependency: XvcStep { name: "create-test-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-train-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "create-validate-array" } -> Step(StepDep { name: "install-requirements" })
[INFO] Found explicit dependency: XvcStep { name: "install-requirements" } -> Step(StepDep { name: "init-venv" })
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-test-array" } (via XvcPath("data/test/images.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-test-array" } (via XvcPath("data/test/classes.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-train-array" } (via XvcPath("data/train/images.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-train-array" } (via XvcPath("data/train/classes.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-validate-array" } (via XvcPath("data/validate/images.npy"))
[INFO][pipeline/src/pipeline/mod.rs::151] Found implicit dependency: XvcStep { name: "train-model" } -> XvcStep { name: "create-validate-array" } (via XvcPath("data/validate/classes.npy"))
[INFO][pipeline/src/pipeline/mod.rs::343] Pipeline Graph:
digraph {
0 [ label = "(30024, 14850552671149047786)" ]
1 [ label = "(30009, 11376621678660215310)" ]
2 [ label = "(30011, 9338166212381570306)" ]
3 [ label = "(30010, 8484021102039729264)" ]
4 [ label = "(30012, 12907533602545881359)" ]
5 [ label = "(30016, 17450406389616117859)" ]
6 [ label = "(30018, 2681008057348839262)" ]
2 -> 6 [ label = "Step" ]
3 -> 6 [ label = "Step" ]
4 -> 6 [ label = "Step" ]
6 -> 5 [ label = "Step" ]
0 -> 2 [ label = "File" ]
0 -> 3 [ label = "File" ]
0 -> 4 [ label = "File" ]
}
[INFO] No dependency steps for step init-venv
[INFO] Waiting for dependency steps for step create-validate-array
[INFO] Waiting for dependency steps for step train-model
[INFO] No dependency steps for step recheck-data
[INFO] [recheck-data] Dependencies has changed
[INFO] Waiting for dependency steps for step install-requirements
[INFO] Waiting for dependency steps for step create-train-array
[INFO] Waiting for dependency steps for step create-test-array
[INFO] [init-venv] No changed dependencies. Skipping thorough comparison.
[INFO] [init-venv] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step install-requirements
[INFO] [install-requirements] No changed dependencies. Skipping thorough comparison.
[INFO] [install-requirements] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step create-train-array
[INFO] Dependency steps completed successfully for step create-test-array
[INFO] Dependency steps completed successfully for step create-validate-array
[INFO] [create-test-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-test-array] No missing Outputs and no changed dependencies
[INFO] [create-validate-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-validate-array] No missing Outputs and no changed dependencies
[INFO] [create-train-array] No changed dependencies. Skipping thorough comparison.
[INFO] [create-train-array] No missing Outputs and no changed dependencies
[INFO] Dependency steps completed successfully for step train-model
[DONE] recheck-data (xvc file recheck data/train/ data/validate/ data/test/)
[INFO] [train-model] Dependencies has changed
[OUT] [train-model] [1, 2000] loss: 0.921
Accuracy of the network on the validation images: 72 %
[2, 2000] loss: 0.426
Accuracy of the network on the validation images: 83 %
Confusion Matrix:
[[174 0 0 1 2 0 1 2 0 2 0 14 0 1 3]
[ 1 132 60 0 0 0 1 0 0 0 0 5 1 0 0]
[ 3 1 157 34 0 0 3 0 0 0 1 1 0 0 0]
[ 2 0 34 160 0 2 2 0 0 0 0 0 0 0 0]
[ 1 0 0 0 186 0 0 1 0 2 0 9 0 0 1]
[ 3 0 11 12 0 145 1 0 0 9 1 12 3 2 1]
[ 3 1 1 0 1 0 133 8 16 9 6 10 2 10 0]
[ 0 0 0 0 3 1 5 145 3 8 25 2 1 1 6]
[ 0 0 0 0 0 0 1 1 181 4 1 1 0 4 7]
[ 2 0 0 0 2 1 0 3 7 142 4 3 0 7 29]
[ 0 0 0 0 1 0 1 0 0 1 193 2 2 0 0]
[ 4 0 0 0 21 4 0 5 1 1 4 152 1 4 3]
[ 0 1 1 1 0 1 3 1 0 0 55 4 132 0 1]
[ 5 0 0 0 2 0 0 2 0 0 1 36 0 153 1]
[ 0 0 0 0 8 0 0 1 2 5 0 0 0 7 177]]
[DONE] train-model (.venv/bin/python3 train.py --train_dir data/train/ --val_dir data/validate --test_dir data/test)
We now have a model and a result file. Let's track the model with Xvc as well.
$ xvc file track model.pth results.json
Sharing Data and Models
graph LR A[Data Gathering ✅] --> B[Splitting Test and Train Sets ✅] B --> C[Preprocessing Images into Numpy Arrays ✅] C --> D[Training Model ✅] D --> E[Sharing Data and Models]
Sharing a machine learning project with Xvc means to share the Git repository and the data and model files that are tracked by Xvc in this repository. For the first, we can use any kind of Git remote, e.g. Github. Xvc doesn't require any special setup (like Git-LFS) to share binary files.
In order to share the binary files, we need to specify an Xvc storage. This can be on a local folder, an SSH host with rsync, AWS S3 bucket or any of the supported storage backends. (See xvc storage new
documentation for the full list.)
In this example, we'll create a new S3 bucket and share all files there.
$ xvc storage new s3 --name my-s3 --bucket-name xvc-test --region eu-central-1 --storage-prefix how-to-create-a-pipeline
$ xvc file send
? 2
error: the following required arguments were not provided:
--remote <REMOTE>
Usage: xvc file send --remote <REMOTE> [TARGETS]...
For more information, try '--help'.
These two commands will define a new remote storage and sends all files to this storage. When you want to share the pipeline and all code and data it runs with, they can clone the repository and run the following command to get the files. Don't forget to push the most recent version of your repository.
$ git push
# On another machine
$ git clone git@github.com:my-user/my-ml-pipeline
$ xvc file bring
Note that, the second time there is no need to configure the remote storage, but the user must have AWS credentials in their environment. You can also automate this on Github and train your pipelines on CI.
In this how-to we created an end-to-end machine learning pipeline. Please ask about any issues that are not clear in the comment box below. Thank you for reading so far.
Command Reference
Synopsis
$ xvc --help
Xvc CLI to manage data and ML pipelines
Usage: xvc [OPTIONS] <COMMAND>
Commands:
file File and directory management commands
init Initialize an Xvc project
pipeline Pipeline management commands
storage Storage (cloud) management commands
root Find the root directory of a project
check-ignore Check whether files are ignored with `.xvcignore`
aliases Print command aliases to be sourced in shell files
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Output verbosity. Use multiple times to increase the output detail
--quiet Suppress all output
--debug Turn on all logging to $TMPDIR/xvc.log
-C <WORKDIR> Set working directory for the command. It doesn't create a new shell, or change the directory [default: .]
-c, --config <CONFIG> Configuration options set from the command line in the form section.key=value You can use multiple times
--no-system-config Ignore system configuration file
--no-user-config Ignore user configuration file
--no-project-config Ignore project configuration file (.xvc/config)
--no-local-config Ignore local (gitignored) configuration file (.xvc/config.local)
--no-env-config Ignore configuration options obtained from environment variables
--skip-git Don't run automated Git operations for this command. If you want to run git commands yourself all the time, you can set `git.auto_commit` and `git.auto_stage` options in the configuration to False
--from-ref <FROM_REF> Checkout the given Git reference (branch, tag, commit etc.) before performing the Xvc operation. This runs `git checkout <given-value>` before running the command
--to-branch <TO_BRANCH> If given, create (or checkout) the given branch before committing results of the operation. This runs `git checkout --branch <given-value>` before committing the changes
-h, --help Print help
-V, --version Print version
Subcommands
file
: File and directory management commandsinit
: Initialize an Xvc projectpipeline
: Pipeline management commandsstorage
: Storage (cloud) management commandsroot
: Find the root directory of a projectcheck-ignore
: Check whether files are ignored with.xvcignore
aliases
Print command aliases to be sourced in shell files
xvc init
Synopsis
$ xvc init --help
Initialize an Xvc project
Usage: xvc init [OPTIONS]
Options:
--path <PATH> Path to the directory to be intialized. (default: current directory)
--no-git Don't require Git
--force Create the repository even if already initialized. Overwrites the current .xvc directory Resets all data and guid, etc
-h, --help Print help
-V, --version Print version
Examples
To initialize a blank Xvc repository, initialize Git first and run xvc init
.
$ cd my-project-1
$ git init
...
$ xvc init
? 0
The command doesn't print anything upon success.
If you want to initialize
File Management
Synopsis
$ xvc file --help
File and directory management commands
Usage: xvc file [OPTIONS] <COMMAND>
Commands:
track Add file and directories to Xvc
hash Get digest hash of files with the supported algorithms
recheck Get files from cache by copy or *link
carry-in Carry (commit) changed files to cache
copy Copy from source to another location in the workspace
move Move files to another location in the workspace
list List tracked and untracked elements in the workspace
send Send (push, upload) files to external storages
bring Bring (download, pull, fetch) files from external storages
remove Remove files from Xvc and possibly storages
untrack Untrack (delete) files from Xvc and possibly storages
share Share a file from S3 compatible storage for a limited time
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Verbosity level. Use multiple times to increase command output detail
--quiet Suppress error messages
-C <WORKDIR> Set the working directory to run the command as if it's in that directory [default: .]
-c, --config <CONFIG> Configuration options set from the command line in the form section.key=value
--no-system-config Ignore system config file
--no-user-config Ignore user config file
--no-project-config Ignore project config (.xvc/config)
--no-local-config Ignore local config (.xvc/config.local)
--no-env-config Ignore configuration options from the environment
-h, --help Print help
-V, --version Print version
Subcommands
track
: Track (add) files with Xvcrecheck
: Copy/link files in the cache to the workspace (checkout)carry-in
: Carry-in (commit) changed files to cachecopy
: Copy files to another location in the workspacemove
: Move files to another location in the workspacelist
: List tracked filessend
: Send (push- ) files to storage
bring
: Bring (pull) files from storagehash
: Calculate hashes with supported algorithms similar to sha256sum, blake2sum, etc.remove
: Remove files from Xvc cache or storagesuntrack
: Untrack (delete) files from Xvc
xvc file track
Purpose
xvc file track
is used to register any kind of file to Xvc for tracking versions.
Synopsis
$ xvc file track --help
Add file and directories to Xvc
Usage: xvc file track [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to track
Options:
--recheck-method <RECHECK_METHOD>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--no-commit
Do not copy/link added files to the file cache
--text-or-binary <TEXT_OR_BINARY>
Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)
--force
Add targets even if they are already tracked
--no-parallel
Don't use parallelism
-h, --help
Print help (see a summary with '-h')
Examples
File tracking works only in Xvc repositories.
$ git init
...
$ xvc init
Let's create a directory tree for these examples.
$ xvc-test-helper create-directory-tree --directories 4 --files 3 --seed 20231021
$ tree
.
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
├── dir-0002
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
├── dir-0003
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0004
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
5 directories, 12 files
By default, the command runs similar to git add
and git commit
.
You can track individual files.
$ xvc file track dir-0001/file-0001.bin
You can track directories with the same command.
$ xvc file track dir-0002/
You can specify more than one target in a single command.
$ xvc file track dir-0001/file-0002.bin dir-0001/file-0003.bin
When you track a file, Xvc moves the file to the cache directory under .xvc/
and connects the workspace file with the cached file. This connection is
called rechecking and analogous to Git checkout. For example, the above
commands create a directory tree under .xvc
as follows:
$ tree .xvc/b3
.xvc/b3
├── 493
│ └── eeb
│ └── 6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79
│ └── 0.bin
├── ab3
│ └── 619
│ └── 814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0
│ └── 0.bin
└── e51
└── 7d6
└── b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70
└── 0.bin
10 directories, 3 files
There are different recheck (checkout) methods that Xvc connects the workspace file to the cache. The default method for this is copying the file to the workspace. This way a separate copy of the cache file is created in the workspace.
If you want to make this connection with symbolic links, you can specify it with --recheck-method
option.
$ xvc file track --recheck-method symlink dir-0003/file-0001.bin
$ ls -l dir-0003/file-0001.bin
lrwxr-xr-x[..] dir-0003/file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
You can also use --hardlink
and --reflink
options. Please see xvc file recheck
reference for details.
$ xvc file track --recheck-method hardlink dir-0003/file-0002.bin
$ xvc file track --recheck-method reflink dir-0003/file-0003.bin
$ ls -l dir-0003/
total 16
l[..] file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
-[..] file-0002.bin
-[..] file-0003.bin
Note that, unlike DVC that specifies checkout/recheck option repository wide, Xvc lets you specify per file. You can recheck files data files as symbolic links (which are non-writable) and save space and make model files as copies of the cached original and commit (carry-in) every time they change.
When you track a file in Xvc, it's automatically commit (carry-in) to the cache
directory. If you want to postpone this operation and don't need a cached copy
for a file, you can use --no-commit
option. You can later use xvc file
carry-in command to move these files to the repository
cache.
$ xvc file track --no-commit --recheck-method symlink dir-0004/
$ ls -l dir-0004/
total 24
-rw-r--r--[..] file-0001.bin
-rw-r--r--[..] file-0002.bin
-rw-r--r--[..] file-0003.bin
$ xvc file list dir-0004/
FS [..] ab361981 ab361981 dir-0004/file-0003.bin
FS [..] 493eeb65 493eeb65 dir-0004/file-0002.bin
FS [..] e517d6b9 e517d6b9 dir-0004/file-0001.bin
Total #: 3 Workspace Size: 6006 Cached Size: 6006
You can carry-in (commit) these files to the cache with xvc file carry-in
command. Note that, as the files are deduplicated, we need to use --force
in
carry-in command. This behavior may change in the future.
$ xvc file carry-in --force dir-0004/
$ ls -l dir-0004/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/e51/7d6/b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/493/eeb/6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/ab3/619/814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0/0.bin
Xvc deduplicates files in the cache. If you track a file that is already in the cache, it won't be moved to the cache again. It will be copied, linked from the same copy.
$ tree .xvc/b3
.xvc/b3
├── 493
│ └── eeb
│ └── 6525ea5e94e1e760371108e4a525c696c773a774a4818e941fd6d1af79
│ └── 0.bin
├── ab3
│ └── 619
│ └── 814cae0456a5a291e4d5c8d339a8389630e476f9f9e8d3a09accc919f0
│ └── 0.bin
└── e51
└── 7d6
└── b9a3617fdcd96bd128142a39f1eca26ed77a338d2b93ba4921a0116c70
└── 0.bin
10 directories, 3 files
Caveats
-
This command doesn't discriminate symbolic links or hardlinks. Links are followed and any broken links may cause errors.
-
Under the hood, Xvc tracks only the files, not directories. Directories are considered as path collections. It doesn't matter if you track a directory or files in it separately.
Technical Details
- Detecting changes in files and directories employ different kinds of associated digests. If a file has different metadata digest, its content digest is calculated. If file's content digest has changed, the file is considered changed. A directory that contains different set of files, or files with changed content is considered changed.
xvc file untrack
Synopsis
$ xvc file untrack --help
Untrack (delete) files from Xvc and possibly storages
Usage: xvc file untrack [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]... Files/directories to untrack
Options:
--restore-versions <RESTORE_VERSIONS>
Restore all versions to a directory before deleting the cache files
-h, --help
Print help
Examples
This command removes a file from Xvc tracking and optionally deletes it from the local filesystem, cache, and the storages.
It only works if the file is tracked by Xvc.
$ git init
...
$ xvc init
$ xvc file track 'd*.txt'
$ xvc file list
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
Without any options, it removes the file from Xvc tracking and the cache.
xvc file untrack
doesn't modify the .gitignore
files to remove the previously tracked files. You must do it manually if you want to track the file with Git.
$ xvc file untrack data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3
$ git status
On branch [..]
nothing to commit, working tree clean
If you have rechecked the file as symlink or reflink, it will be copied to the workspace.
$ xvc file track data.txt --as symlink
$ lsd -l
lrwxr-xr-x [..] data.txt ⇒ [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
$ xvc file untrack data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3
$ lsd -l
.rw-rw-rw- [..] data.txt
If there are multiple versions of the file, it removes them all and restores the latest version.
If you want to restore all versions of the file, you can specify a directory to restore them.
$ xvc file track data.txt
$ perl -pi -e 's/a/e/g' data.txt
$ xvc file carry-in data.txt
$ xvc file untrack data.txt --restore-versions data-versions/
[COPY] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt -> [CWD]/data-versions/data-b3-660-2cf-f6a4.txt
[COPY] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt -> [CWD]/data-versions/data-b3-c85-f3e-8108.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
[DELETE] [CWD]/.xvc/b3/660/2cf
[DELETE] [CWD]/.xvc/b3/660
[DELETE] [CWD]/.xvc/b3
$ lsd -l data-versions/
.r--r--r-- [..] data-b3-660-2cf-f6a4.txt
.r--r--r-- [..] data-b3-c85-f3e-8108.txt
If multiple paths are pointing to the same cache file (with deduplication), the cache file will not be
deleted. In this case, untrack
reports other paths pointing to the same cache file. You must untrack all of them to
delete the cache file.
$ xvc file track data.txt
$ xvc file copy data.txt data2.txt --as symlink
$ xvc file untrack data.txt
Not deleting b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt (for data.txt) because it's also used by data2.txt
$ tree .xvc/b3/
.xvc/b3/
└── 660
└── 2cf
└── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
└── 0.txt
4 directories, 1 file
$ xvc file untrack data2.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
[DELETE] [CWD]/.xvc/b3/660/2cf
[DELETE] [CWD]/.xvc/b3/660
[DELETE] [CWD]/.xvc/b3
xvc file list
Synopsis
$ xvc file list --help
List tracked and untracked elements in the workspace
Usage: xvc file list [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to list.
If not supplied, lists all files under the current directory.
Options:
-f, --format <FORMAT>
A string for each row of the output table
The following are the keys for each row:
- {{acd8}}: actual content digest from the workspace file. First 8 digits.
- {{acd64}}: actual content digest. All 64 digits.
- {{aft}}: actual file type. Whether the entry is a file (F), directory (D),
symlink (S), hardlink (H) or reflink (R).
- {{asz}}: actual size. The size of the workspace file in bytes. It uses MB,
GB and TB to represent sizes larger than 1MB.
- {{ats}}: actual timestamp. The timestamp of the workspace file.
- {{name}}: The name of the file or directory.
- {{cst}}: cache status. One of "=", ">", "<", "X", or "?" to show
whether the file timestamp is the same as the cached timestamp, newer,
older, not cached or not tracked.
- {{rcd8}}: recorded content digest stored in the cache. First 8 digits.
- {{rcd64}}: recorded content digest stored in the cache. All 64 digits.
- {{rrm}}: recorded recheck method. Whether the entry is linked to the workspace
as a copy (C), symlink (S), hardlink (H) or reflink (R).
- {{rsz}}: recorded size. The size of the cached content in bytes. It uses
MB, GB and TB to represent sizes larged than 1MB.
- {{rts}}: recorded timestamp. The timestamp of the cached content.
The default format can be set with file.list.format in the config file.
-s, --sort <SORT>
Sort criteria.
It can be one of none (default), name-asc, name-desc, size-asc, size-desc, ts-asc, ts-desc.
The default option can be set with file.list.sort in the config file.
--no-summary
Don't show total number and size of the listed files.
The default option can be set with file.list.no_summary in the config file.
-a, --show-dot-files
Don't hide dot files
If not supplied, hides dot files like .gitignore and .xvcignore
-h, --help
Print help (see a summary with '-h')
Examples
For these examples, we'll create a directory tree with five directories, each having a file.
$ xvc-test-helper create-directory-tree --directories 5 --files 5 --seed 20230213
$ tree
.
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0002
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0003
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
├── dir-0004
│ ├── file-0001.bin
│ ├── file-0002.bin
│ ├── file-0003.bin
│ ├── file-0004.bin
│ └── file-0005.bin
└── dir-0005
├── file-0001.bin
├── file-0002.bin
├── file-0003.bin
├── file-0004.bin
└── file-0005.bin
[..] directories, 25 files
xvc file list
command works only in Xvc repositories. As we didn't initialize
a repository yet, it reports an error.
$ xvc file list
? 1
[ERROR] File Error: [E2004] Requires xvc repository.
Error: FileError { source: RequiresXvcRepository }
Let's initialize the repository.
$ git init
...
$ xvc init
Now it lists all files and directories.
$ xvc file list --sort name-asc
DX 224 [..] dir-0001
FX 2001 [..] 1953f05d dir-0001/file-0001.bin
FX 2002 [..] 7e807161 dir-0001/file-0002.bin
FX 2003 [..] d2432259 dir-0001/file-0003.bin
FX 2004 [..] 63535612 dir-0001/file-0004.bin
FX 2005 [..] 447933dc dir-0001/file-0005.bin
DX 224 [..] dir-0002
FX 2001 [..] 1953f05d dir-0002/file-0001.bin
FX 2002 [..] 7e807161 dir-0002/file-0002.bin
FX 2003 [..] d2432259 dir-0002/file-0003.bin
FX 2004 [..] 63535612 dir-0002/file-0004.bin
FX 2005 [..] 447933dc dir-0002/file-0005.bin
DX 224 [..] dir-0003
FX 2001 [..] 1953f05d dir-0003/file-0001.bin
FX 2002 [..] 7e807161 dir-0003/file-0002.bin
FX 2003 [..] d2432259 dir-0003/file-0003.bin
FX 2004 [..] 63535612 dir-0003/file-0004.bin
FX 2005 [..] 447933dc dir-0003/file-0005.bin
DX 224 [..] dir-0004
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2003 [..] d2432259 dir-0004/file-0003.bin
FX 2004 [..] 63535612 dir-0004/file-0004.bin
FX 2005 [..] 447933dc dir-0004/file-0005.bin
DX 224 [..] dir-0005
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2003 [..] d2432259 dir-0005/file-0003.bin
FX 2004 [..] 63535612 dir-0005/file-0004.bin
FX 2005 [..] 447933dc dir-0005/file-0005.bin
Total #: 30 Workspace Size: 51195 Cached Size: 0
By default the command hides dotfiles. If you also want to show them, you can use --show-dot-files
/-a
flag.
$ xvc file list --sort name-asc --show-dot-files
FX [..] [..] [..] .gitignore
FX [..] [..] [..] .xvcignore
DX 224 [..] dir-0001
FX 2001 [..] 1953f05d dir-0001/file-0001.bin
FX 2002 [..] 7e807161 dir-0001/file-0002.bin
FX 2003 [..] d2432259 dir-0001/file-0003.bin
FX 2004 [..] 63535612 dir-0001/file-0004.bin
FX 2005 [..] 447933dc dir-0001/file-0005.bin
DX 224 [..] dir-0002
FX 2001 [..] 1953f05d dir-0002/file-0001.bin
FX 2002 [..] 7e807161 dir-0002/file-0002.bin
FX 2003 [..] d2432259 dir-0002/file-0003.bin
FX 2004 [..] 63535612 dir-0002/file-0004.bin
FX 2005 [..] 447933dc dir-0002/file-0005.bin
DX 224 [..] dir-0003
FX 2001 [..] 1953f05d dir-0003/file-0001.bin
FX 2002 [..] 7e807161 dir-0003/file-0002.bin
FX 2003 [..] d2432259 dir-0003/file-0003.bin
FX 2004 [..] 63535612 dir-0003/file-0004.bin
FX 2005 [..] 447933dc dir-0003/file-0005.bin
DX 224 [..] dir-0004
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2003 [..] d2432259 dir-0004/file-0003.bin
FX 2004 [..] 63535612 dir-0004/file-0004.bin
FX 2005 [..] 447933dc dir-0004/file-0005.bin
DX 224 [..] dir-0005
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2003 [..] d2432259 dir-0005/file-0003.bin
FX 2004 [..] 63535612 dir-0005/file-0004.bin
FX 2005 [..] 447933dc dir-0005/file-0005.bin
Total #: 32 Workspace Size: 51443 Cached Size: 0
You can also hide the summary below the list to get only the list of files.
$ xvc file list --sort name-asc --no-summary
DX 224 [..] dir-0001
FX 2001 [..] 1953f05d dir-0001/file-0001.bin
FX 2002 [..] 7e807161 dir-0001/file-0002.bin
FX 2003 [..] d2432259 dir-0001/file-0003.bin
FX 2004 [..] 63535612 dir-0001/file-0004.bin
FX 2005 [..] 447933dc dir-0001/file-0005.bin
DX 224 [..] dir-0002
FX 2001 [..] 1953f05d dir-0002/file-0001.bin
FX 2002 [..] 7e807161 dir-0002/file-0002.bin
FX 2003 [..] d2432259 dir-0002/file-0003.bin
FX 2004 [..] 63535612 dir-0002/file-0004.bin
FX 2005 [..] 447933dc dir-0002/file-0005.bin
DX 224 [..] dir-0003
FX 2001 [..] 1953f05d dir-0003/file-0001.bin
FX 2002 [..] 7e807161 dir-0003/file-0002.bin
FX 2003 [..] d2432259 dir-0003/file-0003.bin
FX 2004 [..] 63535612 dir-0003/file-0004.bin
FX 2005 [..] 447933dc dir-0003/file-0005.bin
DX 224 [..] dir-0004
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2003 [..] d2432259 dir-0004/file-0003.bin
FX 2004 [..] 63535612 dir-0004/file-0004.bin
FX 2005 [..] 447933dc dir-0004/file-0005.bin
DX 224 [..] dir-0005
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2003 [..] d2432259 dir-0005/file-0003.bin
FX 2004 [..] 63535612 dir-0005/file-0004.bin
FX 2005 [..] 447933dc dir-0005/file-0005.bin
Output Format
With the default output format, the first two letters show the path type and recheck method, respectively.
For example, if you track dir-0001
as copy
, the first letter is F
for the
files and D
for the directories. The second letter is C
for files, meaning
the file is a copy of the cached file, and it's X
for directories that means
they are not in the cache. Similar to Git, Xvc doesn't track only files and
directories are considered as collection of files.
$ xvc file track dir-0001/
$ xvc file list dir-0001/
FC 2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
FC 2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC 2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC 2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC 2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
If you add another set of files as hardlinks to the cached copies, it will
print the second letter as H
.
$ xvc file track dir-0002/ --recheck-method hardlink
$ xvc file list dir-0002
FH 2005 [..] 447933dc 447933dc dir-0002/file-0005.bin
FH 2004 [..] 63535612 63535612 dir-0002/file-0004.bin
FH 2003 [..] d2432259 d2432259 dir-0002/file-0003.bin
FH 2002 [..] 7e807161 7e807161 dir-0002/file-0002.bin
FH 2001 [..] 1953f05d 1953f05d dir-0002/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
Note, as hardlinks are files with the same inode in the file system
with alternative paths, they are detected as F
.
Symbolic links are typically reported as SS
in the first letters.
It means they are symbolic links on the file system and their recheck method is also
symbolic links.
$ xvc file track dir-0003 --recheck-method symlink
$ xvc file list dir-0003
SS [..] 447933dc dir-0003/file-0005.bin
SS [..] 63535612 dir-0003/file-0004.bin
SS [..] d2432259 dir-0003/file-0003.bin
SS [..] 7e807161 dir-0003/file-0002.bin
SS [..] 1953f05d dir-0003/file-0001.bin
Total #: 5 Workspace Size: [..] Cached Size: 10015
Although not all filesystems support it, R
represents reflinks.
Globs
You may use globs to list files.
$ xvc file list 'dir-*/*-0001.bin'
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
SS [..] 1953f05d dir-0003/file-0001.bin
FH 2[..] 1953f05d 1953f05d dir-0002/file-0001.bin
FC 2[..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size: [..] Cached Size: 2001
Note that all these files are identical. They are cached once, and only one of them takes space in the cache.
You can also use multiple targets as globs.
$ xvc file list '*/*-0001.bin' '*/*-0002.bin'
FX 2002 [..] 7e807161 dir-0005/file-0002.bin
FX 2001 [..] 1953f05d dir-0005/file-0001.bin
FX 2002 [..] 7e807161 dir-0004/file-0002.bin
FX 2001 [..] 1953f05d dir-0004/file-0001.bin
SS [..] 7e807161 dir-0003/file-0002.bin
SS [..] 1953f05d dir-0003/file-0001.bin
FH [..] 7e807161 7e807161 dir-0002/file-0002.bin
FH [..] 1953f05d 1953f05d dir-0002/file-0001.bin
FC [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 10 Workspace Size: [..] Cached Size: 4003
Sorting
You may sort xvc file list
output by name, by modification time and by file
size.
Use --sort
option to specify the sort criteria.
$ xvc file list --sort name-desc dir-0001/
FC 2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
FC 2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC 2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC 2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC 2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
$ xvc file list --sort name-asc dir-0001/
FC 2001 [..] 1953f05d 1953f05d dir-0001/file-0001.bin
FC 2002 [..] 7e807161 7e807161 dir-0001/file-0002.bin
FC 2003 [..] d2432259 d2432259 dir-0001/file-0003.bin
FC 2004 [..] 63535612 63535612 dir-0001/file-0004.bin
FC 2005 [..] 447933dc 447933dc dir-0001/file-0005.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
Column Format
You can specify the columns that the command prints.
For example, if you only want to see the file names, use {{name}}
as the
format string.
The following command sorts all files with their sizes in the workspace, and prints their size and name.
$ xvc file list --format '{{asz}} {{name}}' --sort size-desc dir-0001/
2005 dir-0001/file-0005.bin
2004 dir-0001/file-0004.bin
2003 dir-0001/file-0003.bin
2002 dir-0001/file-0002.bin
2001 dir-0001/file-0001.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
If you want to compare the recorded (cached) hashes and actual hashes in the workspace, you can use {{acd}} {{rcd}} {{name}}
format string.
$ xvc file list --format '{{acd8}} {{rcd8}} {{name}}' --sort ts-asc dir-0001
1953f05d 1953f05d dir-0001/file-0001.bin
7e807161 7e807161 dir-0001/file-0002.bin
d2432259 d2432259 dir-0001/file-0003.bin
63535612 63535612 dir-0001/file-0004.bin
447933dc 447933dc dir-0001/file-0005.bin
Total #: 5 Workspace Size: 10015 Cached Size: 10015
If {{acd8}}
or {{acd64}}
is not present in the format string, Xvc doesn't calculate these hashes. If you have large number of files where the default format (that includes actual content hashes) runs slowly, you may customize it to not to include these columns.
If you want to get a quick glimpse of what needs to carried in, or rechecked,
you can use cache status {{cst}}
column.
$ xvc-test-helper generate-random-file --size 100 dir-0001/a-new-file.bin
$ xvc file list --format '{{cst}} {{name}}' dir-0001/
= dir-0001/file-0005.bin
= dir-0001/file-0004.bin
= dir-0001/file-0003.bin
= dir-0001/file-0002.bin
= dir-0001/file-0001.bin
X dir-0001/a-new-file.bin
Total #: 6 Workspace Size: 10115 Cached Size: 10015
The cache status column shows =
for unchanged files in the cache, X
for
untracked files, >
for files that there is newer version in the cache, and <
for files that there is a newer version in the workspace. The comparison is done
between recorded timestamp and actual timestamp with an accuracy of 1 second.
xvc file hash
Synopsis
$ xvc file hash --help
Get digest hash of files with the supported algorithms
Usage: xvc file hash [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]... Files to process
Options:
-a, --algorithm <ALGORITHM>
Algorithm to calculate the hash. One of blake3, blake2, sha2, sha3. All algorithm variants produce 32-bytes digest
--text-or-binary <TEXT_OR_BINARY>
For "text" remove line endings before calculating the digest. Keep line endings if "binary". "auto" (default) detects the type by checking 0s in the first 8Kbytes, similar to Git [default: auto]
-h, --help
Print help
-V, --version
Print version
xvc file recheck
Synopsis
$ xvc file recheck --help
Get files from cache by copy or *link
Usage: xvc file recheck [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to recheck
Options:
--recheck-method <RECHECK_METHOD>
How to track the file contents in cache: One of copy, symlink, hardlink, reflink.
Note: Reflink support requires "reflink" feature to be enabled and uses copy if the underlying file system doesn't support it.
--no-parallel
Don't use parallelism
--force
Force even if target exists
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
This command has an alias xvc file checkout
if you feel more at home with Git terminology.
Examples
Rechecking is analogous to git checkout. It copies or links a cached file to the workspace.
Let's create an example directory hierarchy as a showcase.
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 231123
$ tree
.
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
3 directories, 6 files
Start by tracking files.
$ git init
...
$ xvc init
$ xvc file track dir-*
Once you added the file to the cache, you can delete the workspace copy.
$ rm dir-0001/file-0001.bin
$ lsd -l dir-0001/file-*
total[..]
drwxr-xr-x [..] dir-0001
drwxr-xr-x [..] dir-0002
Then, recheck the file. By default, it makes a copy of the file.
$ xvc file recheck dir-0001/file-0001.bin
$ lsd -l
.rw-rw-rw- [..] data.txt
You can track and recheck complete directories
$ xvc file track dir-0002/
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/
$ ls -l dir-0002/
total 24
-rw-rw-rw-[..] file-0001.bin
-rw-rw-rw-[..] file-0002.bin
-rw-rw-rw-[..] file-0003.bin
You can use glob patterns to recheck files.
$ xvc file track 'dir-*'
You can update the recheck method of a file. Otherwise it will be kept as same before.
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/ --as symlink
$ ls -l dir-0002/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin
$ rm -rf dir-0002/
$ xvc -v file recheck dir-0002/
$ ls -l dir-0002/
total 0
lrwxr-xr-x[..] file-0001.bin -> [CWD]/.xvc/b3/3c9/255/424e13d9c38a37c5ddd376e1070cdd5de66996fbc82194c462f653856d/0.bin
lrwxr-xr-x[..] file-0002.bin -> [CWD]/.xvc/b3/6bc/65f/581e3a03edb127b63b71c5690be176e2fe265266f70abc65f72613f62e/0.bin
lrwxr-xr-x[..] file-0003.bin -> [CWD]/.xvc/b3/804/fb8/edbb122e735facd7f943c1bbe754e939a968f385c12f56b10411a4a015/0.bin
Symlink and hardlinks are read-only. You can recheck as copy to update.
$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'
? 1
zsh:1: permission denied: dir-0002/file-0001.bin
$ xvc file recheck dir-0002/file-0001.bin --as copy
$ zsh -c 'echo "120912" >> dir-0002/file-0001.bin'
Note that, as files in the cache are kept read-only, hardlinks and symlinks are also read only. Files rechecked as copy are made read-write explicitly.
$ xvc -vv file recheck data.txt --as hardlink
$ ls -l
total[..]
drwxr-xr-x[..] dir-0001
drwxr-xr-x[..] dir-0002
Reflinks are supported by Xvc, but the underlying file system should also support it.
Otherwise it uses copy
.
$ rm -f data.txt
$ xvc file recheck data.txt --as reflink
The above command will create a read only link in macOS APFS and a copy in ext4 or NTFS file systems.
xvc file carry-in
Copies the file changes to cache.
Synopsis
$ xvc file carry-in --help
Carry (commit) changed files to cache
Usage: xvc file carry-in [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to add
Options:
--text-or-binary <TEXT_OR_BINARY>
Calculate digests as text or binary file without checking contents, or by automatically. (Default: auto)
--force
Carry in targets even their content digests are not changed.
This removes the file in cache and re-adds it.
--no-parallel
Don't use parallelism
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
Carry in command works with Xvc repositories.
$ git init
...
$ xvc init
We first track a file.
$ xvc file track data.txt
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
We update the file with a command.
$ perl -i -pe 's/a/ee/g' data.txt
$ cat data.txt
Oh, deetee, my, deetee
$ xvc file list data.txt
FC 23 [..] c85f3e81 e37c686a data.txt
Total #: 1 Workspace Size: 23 Cached Size: 19
Note that the size of the file has increased, as we replace each a
with an ee
.
$ xvc file carry-in data.txt
$ xvc file list data.txt
FC 23 [..] e37c686a e37c686a data.txt
Total #: 1 Workspace Size: 23 Cached Size: 23
xvc file send
Synopsis
$ xvc file send --help
Send (push, upload) files to external storages
Usage: xvc file send [OPTIONS] --storage <STORAGE> [TARGETS]...
Arguments:
[TARGETS]... Targets to send/push/upload to storage
Options:
-s, --storage <STORAGE> Storage name or guid to send the files
--force Force even if the files are already present in the storage
-h, --help Print help
xvc file bring
Synopsis
$ xvc file bring --help
Bring (download, pull, fetch) files from external storages
Usage: xvc file bring [OPTIONS] --storage <STORAGE> [TARGETS]...
Arguments:
[TARGETS]...
Targets to bring from the storage
Options:
-s, --storage <STORAGE>
Storage name or guid to send the files
--force
Force even if the files are already present in the workspace
--no-recheck
Don't recheck (checkout) after bringing the file to cache.
This makes the command similar to `git fetch` in Git. It just updates the cache, and doesn't copy/link the file to workspace.
--recheck-as <RECHECK_AS>
Recheck (checkout) the file in one of the four alternative ways. (See `xvc file recheck`) and [RecheckMethod]
-h, --help
Print help (see a summary with '-h')
xvc file move
Synopsis
$ xvc file move --help
Move files to another location in the workspace
Usage: xvc file move [OPTIONS] <SOURCE> <DESTINATION>
Arguments:
<SOURCE>
Source file, glob or directory within the workspace.
If the source ends with a slash, it's considered a directory and all files in that directory are copied.
If there are multiple source files, the destination must be a directory.
<DESTINATION>
Location we move file(s) to within the workspace.
If this ends with a slash, it's considered a directory and created if it doesn't exist.
If the number of source files is more than one, the destination must be a directory.
Options:
--recheck-method <RECHECK_METHOD>
How the destination should be rechecked: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--no-recheck
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
This command is used to move a set of files to another location in the workspace.
By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.
xvc file move
works only with the tracked files.
$ git init
...
$ xvc init
$ xvc file track data.txt
$ lsd -l
.rw-rw-rw- [..] data.txt
Once you add the file to the cache, you can move the file to another location.
$ xvc file move data.txt data2.txt
$ ls
data2.txt
Xvc can change the destination file's recheck method.
$ xvc file move data2.txt data3.txt --as symlink
$ ls -l
total[..]
lrwxr-xr-x[..] data3.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can move files without them being in the workspace if they are in the cache.
$ rm -f data3.txt
$ xvc file move data3.txt data4.txt
$ ls -l
total 0
lrwxr-xr-x[..] data4.txt -> [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can use glob patterns to move multiple files. In this case, the destination must be a directory.
$ xvc file copy data4.txt data5.txt
$ xvc file move d*.txt another-set/ --as hardlink
$ xvc file list another-set/
FH [..] c85f3e81 c85f3e81 another-set/data5.txt
FH [..] c85f3e81 c85f3e81 another-set/data4.txt
Total #: 2 Workspace Size: 38 Cached Size: 19
You can also skip rechecking.
In this case, Xvc won't create any copies in the workspace, and you don't need them to be available in the cache.
They will be listed with xvc file list
command.
$ xvc file move another-set/data5.txt data6.txt --no-recheck
$ xvc file list
XH c85f3e81 data6.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data4.txt
DX 96 [..] another-set
Total #: 3 Workspace Size: 115 Cached Size: 19
Later, you can recheck them in the workspace.
$ xvc file recheck data6.txt
$ lsd -l data6.txt
.rw-rw-rw- [..] data6.txt
xvc file copy
Synopsis
$ xvc file copy --help
Copy from source to another location in the workspace
Usage: xvc file copy [OPTIONS] <SOURCE> <DESTINATION>
Arguments:
<SOURCE>
Source file, glob or directory within the workspace.
If the source ends with a slash, it's considered a directory and all files in that directory are copied.
If the number of source files is more than one, the destination must be a directory.
<DESTINATION>
Location we copy file(s) to within the workspace.
If the target ends with a slash, it's considered a directory and created if it doesn't exist.
If the number of source files is more than one, the destination must be a directory.
Options:
--recheck-method <RECHECK_METHOD>
How the targets should be rechecked: One of copy, symlink, hardlink, reflink.
Note: Reflink uses copy if the underlying file system doesn't support it.
--force
Force even if target exists
--no-recheck
Do not recheck the destination files This is useful when you want to copy only records, without updating the workspace
--name-only
When copying multiple files, by default whole path is copied to the destination. This option sets the destination to be created with the file name only
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
This command is used to copy a set of files to another location in the workspace.
By default, it doesn't update the recheck method (cache type) of the targets. It rechecks them to the destination with the same method.
xvc file copy
works only with the tracked files.
$ git init
...
$ xvc init
$ xvc file track data.txt
$ lsd -l
.rw-rw-rw- [..] data.txt
Once you add the file to the cache, you can copy the file to another location.
$ xvc file copy data.txt data2.txt
$ ls
data.txt
data2.txt
Note that, multiple copies of the same content don't add up to the cache size.
$ xvc file list data.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ xvc file list 'data*'
FC 19 [..] c85f3e81 c85f3e81 data2.txt
FC 19 [..] c85f3e81 c85f3e81 data.txt
Total #: 2 Workspace Size: 38 Cached Size: 19
Xvc can change the destination file's recheck method.
$ xvc file copy data.txt data3.txt --as symlink
$ lsd -l
.rw-rw-rw- [..] data.txt
.rw-rw-rw- [..] data2.txt
lrwxr-xr-x [..] data3.txt ⇒ [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
You can create views of your data by copying it to another location.
$ xvc file copy 'd*' another-set/ --as hardlink
$ xvc file list another-set/
FH 19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
Total #: 3 Workspace Size: 57 Cached Size: 19
If the source files you specify are changed, Xvc cancels the copy operation. Please either recheck old versions or carry in new versions.
$ perl -i -pe 's/a/ee/g' data.txt
$ xvc file copy data.txt data5.txt
? 1
[ERROR] File Error: Sources have changed, please carry-in or recheck following files before copying:
data.txt
Error: FileError { source: AnyhowError { source: Sources have changed, please carry-in or recheck following files before copying:
data.txt } }
You can copy files without them being in the workspace if they are in the cache.
$ rm -f data.txt
$ xvc file copy data.txt data6.txt
$ lsd -l data6.txt
.rw-rw-rw- [..] data6.txt
You can also skip rechecking.
In this case, Xvc won't create any copies in the workspace, and you don't need them to be available in the cache.
They will be listed with xvc file list
command.
$ xvc file copy data.txt data7.txt --no-recheck
$ ls
another-set
data2.txt
data3.txt
data6.txt
$ xvc file list
XC [..] c85f3e81 data7.txt
FC 19 [..] c85f3e81 c85f3e81 data6.txt
SS [..] [..] c85f3e81 data3.txt
FC 19 [..] c85f3e81 c85f3e81 data2.txt
XC [..] c85f3e81 data.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data3.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data2.txt
FH 19 [..] c85f3e81 c85f3e81 another-set/data.txt
DX 160 [..] another-set
Total #: 9 Workspace Size: [..] Cached Size: 19
Later, you can recheck them to work in the workspace.
$ xvc file recheck data7.txt
$ lsd -l data7.txt
.rw-rw-rw- [..] data7.txt
xvc file remove
Synopsis
$ xvc file remove --help
Remove files from Xvc and possibly storages
Usage: xvc file remove [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Files/directories to remove
Options:
--from-cache
Remove files from cache
--from-storage <FROM_STORAGE>
Remove files from storage
--all-versions
Remove all versions of the file
--only-version <ONLY_VERSION>
Remove only the specified version of the file
Versions are specified with the content hash 123-456-789abcd. Dashes are optional. Prefix must be unique. If the prefix is not unique, the command will fail.
--force
Remove the targets even if they are used by other targets (via deduplication)
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Examples
This command deletes files from the Xvc cache or storage. It doesn't remove the file from Xvc tracking.
If you want to remove a workspace file or link, you can use usual rm
command. If the file is tracked and carried in to the cache, you can always recheck it.
This command only works if the file is tracked by Xvc.
$ git init
...
$ xvc init
$ xvc file track 'd*.txt'
$ xvc file list
FC [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ tree .xvc/b3/
.xvc/b3/
└── c85
└── f3e
└── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
└── 0.txt
4 directories, 1 file
If you don't specify either --from-cache
or --from-storage
, this command does nothing.
$ xvc file remove data.txt
? failed
error: the following required arguments were not provided:
--from-cache
--from-storage <FROM_STORAGE>
Usage: xvc file remove --from-cache --from-storage <FROM_STORAGE> <TARGETS>...
For more information, try '--help'.
You can remove the file from the cache. The file is still tracked by Xvc and available in the workspace.
$ xvc file remove --from-cache data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
[DELETE] [CWD]/.xvc/b3
$ ls
data.txt
$ ls .xvc/
config.local.toml
config.toml
ec
store
You can carry the missing file from the workspace to the cache. Use --force
to overwrite the cache as carry-in
doesn't overwrite the cache by default.
$ xvc file carry-in --force data.txt
$ xvc file list
FC [..] c85f3e81 c85f3e81 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ tree .xvc/b3/
.xvc/b3/
└── c85
└── f3e
└── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
└── 0.txt
4 directories, 1 file
You can specify a version of a file to delete from the cache. The versions can
be specified like 123-456-789abcd
. Dashes are optional. The prefix must be unique.
$ perl -pi -e 's/a/e/g' data.txt
$ xvc file carry-in data.txt
$ tree .xvc/b3/
.xvc/b3/
├── 660
│ └── 2cf
│ └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
│ └── 0.txt
└── c85
└── f3e
└── 8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
└── 0.txt
7 directories, 2 files
$ xvc file list
FC [..] 6602cff6 6602cff6 data.txt
Total #: 1 Workspace Size: 19 Cached Size: 19
$ xvc file remove --from-cache --only-version c85-f3e data.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496/0.txt
[DELETE] [CWD]/.xvc/b3/c85/f3e/8108a0d53da6b4869e5532a3b72301ed58d5824ed1394d52dbcabe9496
[DELETE] [CWD]/.xvc/b3/c85/f3e
[DELETE] [CWD]/.xvc/b3/c85
$ tree .xvc/b3/
.xvc/b3/
└── 660
└── 2cf
└── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
└── 0.txt
4 directories, 1 file
You can also remove all versions of a file from the cache.
$ xvc-test-helper generate-random-file --seed 0 data.txt
$ xvc file carry-in data.txt
$ rm data.txt
$ xvc-test-helper generate-random-file --seed 1 data.txt
$ xvc file carry-in data.txt
$ tree .xvc/b3/
.xvc/b3/
├── 017
│ └── ad8
│ └── 6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0
│ └── 0.txt
├── 660
│ └── 2cf
│ └── f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
│ └── 0.txt
└── fef
└── e16
└── d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152
└── 0.txt
10 directories, 3 files
$ xvc file remove --from-cache --all-versions data.txt
[DELETE] [CWD]/.xvc/b3/017/ad8/6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0/0.txt
[DELETE] [CWD]/.xvc/b3/017/ad8/6d31011a7f6c8eabd808ba4f8cf3d3c0c65322ded3fffdfcb8d60279a0
[DELETE] [CWD]/.xvc/b3/017/ad8
[DELETE] [CWD]/.xvc/b3/017
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367/0.txt
[DELETE] [CWD]/.xvc/b3/660/2cf/f6a4cbc23a78205463b7086d1b0831d3d74c063122f20c1c2ea0c2d367
[DELETE] [CWD]/.xvc/b3/660/2cf
[DELETE] [CWD]/.xvc/b3/660
[DELETE] [CWD]/.xvc/b3/fef/e16/d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152/0.txt
[DELETE] [CWD]/.xvc/b3/fef/e16/d9668f4c96ee7e719517f056aa23653fe9aaeddc9bfe81324fff534152
[DELETE] [CWD]/.xvc/b3/fef/e16
[DELETE] [CWD]/.xvc/b3/fef
[DELETE] [CWD]/.xvc/b3
$ ls .xvc/
config.local.toml
config.toml
ec
store
You can use this command to remove cached files from (remote) storages as well.
$ xvc-test-helper generate-random-file --seed 2 data.txt
$ xvc file carry-in data.txt
$ xvc storage new local --name local-storage --path '../local-storage'
$ xvc file send data.txt --to local-storage
$ tree ../local-storage/
../local-storage/
└── [..]
└── b3
└── 218
└── 2b7
└── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
└── 0.txt
6 directories, 1 file
$ xvc file remove data.txt --from-storage local-storage
$ tree ../local-storage/
../local-storage/
└── [..]
└── b3
└── 218
└── 2b7
└── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
6 directories, 0 files
Note that, storage delete implementations differ slightly not to remove the directories. This is to avoid unnecessary round trip existence checks.
If multiple paths are pointing to the same cache file (deduplication), the cache file will not be deleted.
In this case, remove
reports other paths pointing to the same cache file. You must --force
delete the cache file.
$ xvc-test-helper generate-random-file --seed 3 data.txt
$ xvc file carry-in data.txt
$ xvc file copy data.txt data2.txt --as symlink
$ xvc file list
SS [..] [..] 4a2e9d7c data2.txt
FC 1024 [..] 4a2e9d7c 4a2e9d7c data.txt
Total #: 2 Workspace Size: [..] Cached Size: 1024
$ xvc file remove --from-cache data.txt
Not deleting b3/4a2/e9d/7c40d2cf892c41351a2465b54b85f62a0052e25a63950c8ab4ac48b2ee/0.txt (for data.txt) because it's also used by data2.txt
$ tree .xvc/b3/
.xvc/b3/
├── 218
│ └── 2b7
│ └── 7f5a61c7a82b34da4c754cce1fe6834fc3f07b3f7c7e0920d1add59881
│ └── 0.txt
└── 4a2
└── e9d
└── 7c40d2cf892c41351a2465b54b85f62a0052e25a63950c8ab4ac48b2ee
└── 0.txt
7 directories, 2 files
Data-Model Pipelines
Synopsis
$ xvc pipeline --help
Pipeline management commands
Usage: xvc pipeline [OPTIONS] <COMMAND>
Commands:
new Create a new pipeline
update Update the name and other attributes of a pipeline
delete Delete a pipeline
run Run a pipeline
list List all pipelines
dag Generate a dot or mermaid diagram for the pipeline
export Export the pipeline to a YAML or JSON file to edit
import Import the pipeline from a file
step Step creation, dependency, output commands
help Print this message or the help of the given subcommand(s)
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-h, --help Print help
xvc pipeline new
Synopsis
$ xvc pipeline new --help
Create a new pipeline
Usage: xvc pipeline new [OPTIONS] --pipeline-name <PIPELINE_NAME>
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
-w, --workdir <WORKDIR> Default working directory
-h, --help Print help
Examples
This command works only in Xvc repositories.
$ git init
...
$ xvc init
You can create a new pipeline with a name.
$ xvc pipeline new --pipeline-name my-pipeline
By default it will run the commands in the repository root.
$ xvc pipeline list
+-------------+---------+
| Name | Run Dir |
+=======================+
| default | |
|-------------+---------|
| my-pipeline | |
+-------------+---------+
If you want to define a pipeline specific to a directory, you can set the working directory.
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230215
$ xvc pipeline new --pipeline-name another-pipeline --workdir dir-0001
The pipeline will run the commands in the specified directory.
$ xvc pipeline list
+------------------+----------+
| Name | Run Dir |
+=============================+
| default | |
|------------------+----------|
| my-pipeline | |
|------------------+----------|
| another-pipeline | dir-0001 |
+------------------+----------+
xvc pipeline list
Synopsis
$ xvc pipeline list --help
List all pipelines
Usage: xvc pipeline list
Options:
-h, --help Print help
Examples
Please see xvc pipeline new
for examples.
xvc pipeline step
Synopsis
$ xvc pipeline step --help
Step creation, dependency, output commands
Usage: xvc pipeline step <COMMAND>
Commands:
list List steps in a pipeline
new Add a new step
remove Remove a step from a pipeline
update Update step options
dependency Add a dependency to a step
output Add an output to a step
show Print step configuration
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
xvc pipeline step new
Purpose
Create a new step in the pipeline.
Synopsis
$ xvc pipeline step new --help
Add a new step
Usage: xvc pipeline step new [OPTIONS] --step-name <STEP_NAME> --command <COMMAND>
Options:
-s, --step-name <STEP_NAME> Name of the new step
-c, --command <COMMAND> Step command to run
--when <WHEN> When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
-h, --help Print help
Examples
This command works only in Xvc repositories.
$ git init
...
$ xvc init
You can create a new step with a name and a command.
$ xvc pipeline step new --step-name hello --command "echo hello"
By default a step will run only if its dependencies have changed. (--when by_dependencies
).
If you want to run the command always, regardless of the changes in dependencies, you can set --when
to always
.
$ xvc pipeline step new --step-name world --command "echo world" --when always
If you want a step to never run, you can set --when
to never
.
$ xvc pipeline step new --step-name never --command "echo never" --when never
You can update when the step will run with xvc pipeline step update
.
You can get the list of steps in the pipeline with export
or dag
.
$ xvc pipeline export
{
"name": "default",
"steps": [
{
"command": "echo hello",
"dependencies": [],
"invalidate": "ByDependencies",
"name": "hello",
"outputs": []
},
{
"command": "echo world",
"dependencies": [],
"invalidate": "Always",
"name": "world",
"outputs": []
},
{
"command": "echo never",
"dependencies": [],
"invalidate": "Never",
"name": "never",
"outputs": []
}
],
"version": 1,
"workdir": ""
}
xvc pipeline step list
Purpose
List the steps and their commands in a pipeline
Synopsis
$ xvc pipeline step list --help
List steps in a pipeline
Usage: xvc pipeline step list [OPTIONS]
Options:
--names-only Show only the names, otherwise print commands as well
-h, --help Print help
Examples
This command works only in Xvc repositories.
$ git init
...
$ xvc init
You may want to list the steps of a pipeline and their commands.
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline step new --step-name world --command "echo world" --when always
$ xvc pipeline step list
hello: echo hello (by_dependencies)
world: echo world (always)
It will list the commands and when they will run (always, never, by_dependencies) by default. If you only need the names of steps, you can use --names-only
flag.
$ xvc pipeline step list --names-only
hello
world
xvc pipeline step dependency
Purpose
Define a dependency to an existing step in the pipeline.
Synopsis
$ xvc pipeline step dependency --help
Add a dependency to a step
Usage: xvc pipeline step dependency [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME>
Name of the step to add the dependency to
--generic <GENERICS>
Add a generic command output as a dependency. Can be used multiple times. Please delimit the command with ' ' to avoid shell expansion
--url <URLS>
Add a URL dependency to the step. Can be used multiple times
--file <FILES>
Add a file dependency to the step. Can be used multiple times
--step <STEPS>
Add a step dependency to a step. Can be used multiple times. Steps are referred with their names
--glob_items <GLOB_ITEMS>
Add a glob items dependency to the step.
You can depend on multiple files and directories with this dependency.
The difference between this and the glob option is that this option keeps track of all matching files, but glob only keeps track of the matched files' digest. When you want to use ${XVC_GLOB_ITEMS}, ${XVC_ADDED_GLOB_ITEMS}, or ${XVC_REMOVED_GLOB_ITEMS} environment variables in the step command, use the glob-items dependency. Otherwise, you can use the glob option to save disk space.
--glob <GLOBS>
Add a glob dependency to the step. Can be used multiple times.
You can depend on multiple files and directories with this dependency.
The difference between this and the glob-items option is that the glob-items option keeps track of all matching files individually, but this option only keeps track of the matched files' digest. This dependency uses considerably less disk space.
--param <PARAMS>
Add a parameter dependency to the step in the form filename.yaml::model.units
The file can be a JSON, TOML, or YAML file. You can specify hierarchical keys like my.dict.key
--regex_items <REGEX_ITEMS>
Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times.
The difference between this and the regex option is that the regex-items option keeps track of all matching lines, but regex only keeps track of the matched lines' digest. When you want to use ${XVC_REGEX_ITEMS}, ${XVC_ADDED_REGEX_ITEMS}, ${XVC_REMOVED_REGEX_ITEMS} environment variables in the step command, use the regex option. Otherwise, you can use the regex-digest option to save disk space.
--regex <REGEXES>
Add a regex dependency in the form filename.txt:/^regex/ . Can be used multiple times.
The difference between this and the regex option is that the regex option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest.
--line_items <LINE_ITEMS>
Add a line dependency in the form filename.txt::123-234
The difference between this and the lines option is that the line-items option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest. When you want to use ${XVC_ALL_LINE_ITEMS}, ${XVC_ADDED_LINE_ITEMS}, ${XVC_CHANGED_LINE_ITEMS} options in the step command, use the line option. Otherwise, you can use the lines option to save disk space.
--lines <LINES>
Add a line digest dependency in the form filename.txt::123-234
The difference between this and the line-items dependency is that the line option keeps track of all matching lines that can be used in the step command. This option only keeps track of the matched lines' digest. If you don't need individual lines to be kept, use this option to save space.
--sqlite-query <SQLITE_FILE> <SQLITE_QUERY>
Add a sqlite query dependency to the step with the file and the query. Can be used once.
The step is invalidated when the query run and the result is different from previous runs, e.g. when an aggregate changed or a new row added to a table.
-h, --help
Print help (see a summary with '-h')
File Dependencies
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Begin by adding a new step.
$ xvc pipeline step new --step-name file-dependency --command "echo data.txt has changed"
Add a file dependency to the step.
$ xvc pipeline step dependency --step-name file-dependency --file data.txt
When you run the command, it will print data.txt has changed
if the file data.txt
has changed.
$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed
[DONE] [file-dependency] (echo data.txt has changed)
You can add multiple dependencies to a step with multiple invocations.
$ xvc pipeline step dependency --step-name file-dependency --file data2.txt
A step will run if any of its dependencies have changed.
$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed
[DONE] [file-dependency] (echo data.txt has changed)
By default, they are not run if none of the dependencies have changed.
$ xvc pipeline run
However, if you want to run the step even if none of the dependencies have changed, you can set the --when
option to always
.
$ xvc pipeline step update --step-name file-dependency --when always
Now the step will run even if none of the dependencies have changed.
$ xvc pipeline run
[OUT] [file-dependency] data.txt has changed
[DONE] [file-dependency] (echo data.txt has changed)
Glob Dependencies
A step can depend on multiple files specified with globs. The difference with this and glob-items dependency is that this one doesn't track the files, and doesn't pass the list of files in environment variables to the command.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Let's create a set of files:
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 2023
$ tree
.
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
3 directories, 6 files
Add a step to say files has changed when the files have changed.
$ xvc pipeline step new --step-name files-changed --command "echo 'Files have changed.'"
$ xvc pipeline step dependency --step-name files-changed --glob 'dir-*/*'
The step is invalidated when a file described by the glob is added, removed or changed.
$ xvc pipeline run
[OUT] [files-changed] Files have changed.
[DONE] [files-changed] (echo 'Files have changed.')
$ xvc pipeline run
When a file is removed from the files described by the glob, the step is invalidated.
$ rm dir-0001/file-0001.bin
$ xvc pipeline run
[OUT] [files-changed] Files have changed.
[DONE] [files-changed] (echo 'Files have changed.')
Regex Dependencies
You can specify a regular expression matched against the lines from a file as a dependency. The step is invalidated when the matched results changed.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Now, let's add a step to the pipeline to count females in the file:
$ xvc pipeline step new --step-name count-females --command "grep -c '\"F\",' people.csv"
These commands are run when the regex dependencies change.
$ xvc pipeline step dependency --step-name count-females --regex 'people.csv:/^.*"F",.*$'
When you run the pipeline initially, the steps are run.
$ xvc pipeline run
[OUT] [count-females] 7
[DONE] [count-females] (grep -c '"F",' people.csv)
When you run the pipeline again, the step is not run because the regex result didn't change.
$ xvc pipeline run
When you add a new female record to the file, the step is run and the command prints the new count.
$ zsh -c "echo '\"Asude\", \"F\", 12, 55, 110' >> people.csv"
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
"Asude", "F", 12, 55, 110
$ xvc pipeline run
[OUT] [count-females] 8
[DONE] [count-females] (grep -c '"F",' people.csv)
Line Dependencies
You can make your steps to depend on lines of text files. The lines are defined by starting and ending indices.
When the text in those lines change, the step is invalidated.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Let's a step to show the first 10 lines of the file:
$ xvc pipeline step new --step-name print-top-10 --command "head people.csv"
The command is run only when those lines change.
$ xvc pipeline step dependency --step-name print-top-10 --lines 'people.csv::1-10'
When you run the pipeline initially, the step is run.
$ xvc pipeline run
[OUT] [print-top-10] "Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
[DONE] [print-top-10] (head people.csv)
When you run the pipeline again, the step is not run because the specified lines didn't change.
$ xvc pipeline run
When you change a line from the file, the step is invalidated.
$ perl -i -pe 's/Hank/Ferzan/g' people.csv
Now, when you run the pipeline, it will print the first 10 lines again.
$ xvc pipeline run
[OUT] [print-top-10] "Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Ferzan", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
[DONE] [print-top-10] (head people.csv)
Glob Items Dependency
A step can depend on multiple files specified with globs. When any of the files change, or a new file is added or removed from the files specified by glob, the step is invalidated.
Unline glob dependency, glob items dependency keeps track of the individual files that belong to a glob. If your command run with the list of files from a glob and you want to track added and removed files, use this. Otherwise if your command for all the files in a glob and don't need to track which files have changed, use the glob dependency.
This one injects ${XVC_ADDED_GLOB_ITEMS}
, ${XVC_REMOVED_GLOB_ITEMS}
, ${XVC_CHANGED_GLOB_ITEMS}
and ${XVC_ALL_GLOB_ITEMS}
to the command
environment.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Let's create a set of files:
$ xvc-test-helper create-directory-tree --directories 2 --files 3 --seed 2023
$ tree
.
├── dir-0001
│ ├── file-0001.bin
│ ├── file-0002.bin
│ └── file-0003.bin
└── dir-0002
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
3 directories, 6 files
Add a step to list the added files.
$ xvc pipeline step new --step-name files-changed --command 'echo "### Added Files:\n${XVC_ADDED_GLOB_ITEMS}\n### Removed Files:\n${XVC_REMOVED_GLOB_ITEMS}\n### Changed Files:\n${XVC_CHANGED_GLOB_ITEMS}"'
$ xvc pipeline step dependency --step-name files-changed --glob-items 'dir-*/*'
The step is invalidated when a file described by the glob is added, removed or changed.
$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
dir-0001/file-0001.bin
dir-0001/file-0002.bin
dir-0001/file-0003.bin
dir-0002/file-0001.bin
dir-0002/file-0002.bin
dir-0002/file-0003.bin
### Removed Files:
### Changed Files:
[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")
$ xvc pipeline run
If you add or remove a file from the files specified by the glob, they are printed.
$ rm dir-0001/file-0001.bin
$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
### Removed Files:
dir-0001/file-0001.bin
### Changed Files:
[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")
When you change a file, it's printed in both added and removed files:
$ xvc-test-helper generate-filled-file dir-0001/file-0002.bin
$ xvc pipeline run
[OUT] [files-changed] ### Added Files:
### Removed Files:
### Changed Files:
dir-0001/file-0002.bin
[DONE] [files-changed] (echo "### Added Files:/n${XVC_ADDED_GLOB_ITEMS}/n### Removed Files:/n${XVC_REMOVED_GLOB_ITEMS}/n### Changed Files:/n${XVC_CHANGED_GLOB_ITEMS}")
Regex Item Dependencies
You can specify a regular expression matched against the lines from a file as a dependency. The step is invalidated when the matched results changed.
Unlike regex dependencies, regex item dependencies keep track of the matched items. You can access them with
${XVC_ALL_REGEX_ITEMS}
, ${XVC_ADDED_REGEX_ITEMS}
, and ${XVC_REMOVED_REGEX_ITEMS}
environment variables.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Now, let's add steps to the pipeline to count males and females in the file:
$ xvc pipeline step new --step-name new-males --command 'echo "New Males:\n ${XVC_ADDED_REGEX_ITEMS}"'
$ xvc pipeline step new --step-name new-females --command 'echo "New Females:\n ${XVC_ADDED_REGEX_ITEMS}"'
$ xvc pipeline step dependency --step-name new-females --step new-males
We also added a step dependency to let the steps run always in the same order.
These commands are run when the following regexes change.
$ xvc pipeline step dependency --step-name new-males --regex-items 'people.csv:/^.*"M",.*$'
$ xvc pipeline step dependency --step-name new-females --regex-items 'people.csv:/^.*"F",.*$'
When you run the pipeline initially, the steps are run.
$ xvc pipeline run
[OUT] [new-males] New Males:
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Luke", "M", 34, 72, 163
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Quin", "M", 29, 71, 176
[DONE] [new-males] (echo "New Males:/n ${XVC_ADDED_REGEX_ITEMS}")
[OUT] [new-females] New Females:
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Kate", "F", 47, 69, 139
"Myra", "F", 23, 62, 98
"Page", "F", 31, 67, 135
"Ruth", "F", 28, 65, 131
[DONE] [new-females] (echo "New Females:/n ${XVC_ADDED_REGEX_ITEMS}")
When you run the pipeline again, the steps are not run because the regexes didn't change.
$ xvc pipeline run
When you add a new female record to the file, only the female count step is run.
$ zsh -c "echo '\"Asude\", \"F\", 12, 55, 110' >> people.csv"
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
"Asude", "F", 12, 55, 110
$ xvc pipeline run
[OUT] [new-females] New Females:
"Asude", "F", 12, 55, 110
[DONE] [new-females] (echo "New Females:/n ${XVC_ADDED_REGEX_ITEMS}")
Line Item Dependencies
You can make your steps to depend on lines of text files. The lines are defined by starting and ending indices.
When the text in those lines change, the step is invalidated.
Unlike line dependencies, this dependency type keeps track of the lines in the
file. You can use ${XVC_ALL_LINE_ITEMS}
, ${XVC_ADDED_LINE_ITEMS}
, and
${XVC_REMOVED_LINE_ITEMS}
environment variables in the command. Please be
aware that for large set of lines, this dependency can take up considerable
space to keep track of all lines and if you don't need to keep track of changed
lines, you can use --lines
dependency.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
We'll use a sample CSV file in this example:
$ cat people.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
"Jake", "M", 32, 69, 143
"Kate", "F", 47, 69, 139
"Luke", "M", 34, 72, 163
"Myra", "F", 23, 62, 98
"Neil", "M", 36, 75, 160
"Omar", "M", 38, 70, 145
"Page", "F", 31, 67, 135
"Quin", "M", 29, 71, 176
"Ruth", "F", 28, 65, 131
Let's a step to show the first 10 lines of the file:
$ xvc pipeline step new --step-name print-top-10 --command 'echo "Added Lines:\n ${XVC_ADDED_LINE_ITEMS}\nRemoved Lines:\n${XVC_REMOVED_LINE_ITEMS}"'
The command is run only when those lines change.
$ xvc pipeline step dependency --step-name print-top-10 --line-items 'people.csv::1-10'
When you run the pipeline initially, the step is run.
$ xvc pipeline run
[OUT] [print-top-10] Added Lines:
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
"Carl", "M", 32, 70, 155
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 66, 124
"Fran", "F", 33, 66, 115
"Gwen", "F", 26, 64, 121
"Hank", "M", 30, 71, 158
"Ivan", "M", 53, 72, 175
Removed Lines:
[DONE] [print-top-10] (echo "Added Lines:/n ${XVC_ADDED_LINE_ITEMS}/nRemoved Lines:/n${XVC_REMOVED_LINE_ITEMS}")
When you run the pipeline again, the step is not run because the specified lines didn't change.
$ xvc pipeline run
When you change a line from the file, the step is invalidated.
$ perl -i -pe 's/Hank/Ferzan/g' people.csv
Now, when you run the pipeline, it will print the changed line, with its new and old versions.
$ xvc pipeline run
[OUT] [print-top-10] Added Lines:
"Ferzan", "M", 30, 71, 158
Removed Lines:
"Hank", "M", 30, 71, 158
[DONE] [print-top-10] (echo "Added Lines:/n ${XVC_ADDED_LINE_ITEMS}/nRemoved Lines:/n${XVC_REMOVED_LINE_ITEMS}")
SQLite Query Dependency
You can create a step dependency with an SQLite query. When the query results change, the step is invalidated.
SQLite dependencies doesn't track the results of the query. It just checks whether the query results has changed.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Suppose we have an SQLite database people.db
with the following schema and data:
CREATE TABLE People (
Name TEXT,
Sex TEXT,
Age INTEGER,
Height_in INTEGER,
Weight_lbs INTEGER
);
INSERT INTO People (Name, Sex, Age, Height_in, Weight_lbs) VALUES
('Alex', 'M', 41, 74, 170),
('Bert', 'M', 42, 68, 166),
('Carl', 'M', 32, 70, 155),
('Dave', 'M', 39, 72, 167),
('Elly', 'F', 30, 66, 124),
('Fran', 'F', 33, 66, 115),
('Gwen', 'F', 26, 64, 121),
('Hank', 'M', 30, 71, 158),
('Ivan', 'M', 53, 72, 175),
('Jake', 'M', 32, 69, 143),
('Kate', 'F', 47, 69, 139),
('Luke', 'M', 34, 72, 163),
('Myra', 'F', 23, 62, 98),
('Neil', 'M', 36, 75, 160),
('Omar', 'M', 38, 70, 145),
('Page', 'F', 31, 67, 135),
('Quin', 'M', 29, 71, 176),
('Ruth', 'F', 28, 65, 131);
EOF
Now, we'll add a step to the pipeline to calculate the average age of these people.
$ xvc pipeline step new --step-name average-age --command "sqlite3 people.db 'SELECT AVG(Age) FROM People;'"
Let's run the step without a dependency first.
$ xvc pipeline run
[OUT] [average-age] 34.6666666666667
[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')
Now, we'll add a dependency to this step and it will only run the step when the results of that query changes.
$ xvc pipeline step dependency --step-name average-age --sqlite-query people.db 'SELECT count(*) FROM People;'
The dependency query is run everytime the pipeline runs. It's expected to be lightweight to avoid performance issues.
So, when the number of people in the table changes, the step will run. Initially it doesn't keep track of the query results, so it will run again.
$ xvc pipeline run
[OUT] [average-age] 34.6666666666667
[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')
But it won't run the step a second time, as the table didn't change.
$ xvc pipeline run
Let's add another row to the table:
$ sqlite3 people.db "INSERT INTO People (Name, Sex, Age, Height_in, Weight_lbs) VALUES ('Asude', 'F', 10, 74, 170);"
This time, the step will run again as the result from dependency query (SELECT count(*) FROM People
) changed.
$ xvc pipeline run
[OUT] [average-age] 33.3684210526316
[DONE] [average-age] (sqlite3 people.db 'SELECT AVG(Age) FROM People;')
Xvc opens the database in read-only mode to avoid locking.
(Hyper-)Parameter Dependencies
You may be keeping pipeline-wide parameters in structured text files. You can specify such parameters found in JSON, TOML and YAML files as dependencies.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Suppose we have a YAML file that we specify various parameters for the whole connection.
param: value
database:
server: example.com
port: 5432
connection:
timeout: 5000
numeric_param: 13
Now, we create two steps to read different variables from the file and a dependency between them to force them to run in the same order always.
$ xvc pipeline step new --step-name read-database-config --command 'echo "Updated Database Configuration"'
$ xvc pipeline step new --step-name read-hyperparams --command 'echo "Update Hyperparameters"'
$ xvc pipeline step dependency --step-name read-database-config --step read-hyperparams
Let's create different steps for various pieces of this parameters file:
$ xvc pipeline step dependency --step-name read-database-config --param 'myparams.yaml::database.port' --param 'myparams.yaml::database.server' --param 'myparams.yaml::database.connection'
$ xvc pipeline step dependency --step-name read-hyperparams --param 'myparams.yaml::param' --param 'myparams.yaml::numeric_param'
Run for the first time, as initially all dependencies are invalid:
$ xvc pipeline run
[OUT] [read-hyperparams] Update Hyperparameters
[DONE] [read-hyperparams] (echo "Update Hyperparameters")
[OUT] [read-database-config] Updated Database Configuration
[DONE] [read-database-config] (echo "Updated Database Configuration")
For the second time, it won't read the configuration as nothing is changed:
$ xvc pipeline run
When you update a value in this file, it will only invalidate the steps that depend on the value, not other dependencies that rely on the same file.
Let's update the database port:
$ perl -pi -e 's/5432/9876/g' myparams.yaml
$ xvc pipeline run
[OUT] [read-database-config] Updated Database Configuration
[DONE] [read-database-config] (echo "Updated Database Configuration")
Note that, read-hyperparams
is not invalidated, though the values are in the same file.
Step Dependencies
This command works only in Xvc repositories.
$ git init
...
$ xvc init
You can add a step dependency to a step. These steps specify dependency relationships explicitly, without relying on changed files or directories.
$ xvc pipeline step new --step-name world --command "echo world"
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline step dependency --step-name world --step hello
When run, the dependency will be run first and the step will be run after.
$ xvc pipeline run
[OUT] [hello] hello
[DONE] [hello] (echo hello)
[OUT] [world] world
[DONE] [world] (echo world)
If the dependency is not run, the dependent step won't run either.
$ xvc pipeline step update --step-name hello --when never
$ xvc pipeline run
If you want to run the dependent always, you can set it to run always explicitly.
$ xvc pipeline step update --step-name world --when always
$ xvc pipeline run
[OUT] [world] world
[DONE] [world] (echo world)
URL Dependencies
This command works only in Xvc repositories.
$ git init
...
$ xvc init
You can use a web URL as a dependency to a step. When the URL is fetched, the output hash is saved to compare and the step is invalidated when the output of the URL is changed.
You can use this with any URL.
$ xvc pipeline step new --step-name xvc-docs-update --command "echo 'Xvc docs updated!'"
$ xvc pipeline step dependency --step-name xvc-docs-update --url https://docs.xvc.dev/
The step is invalidated when the page is updated.
$ xvc pipeline run
[OUT] [xvc-docs-update] Xvc docs updated!
[DONE] [xvc-docs-update] (echo 'Xvc docs updated!')
The step won't run again until a new version of the page is published.
$ xvc pipeline run
Note that, Xvc doesn't download the page every time. It checks the Last-Modified
and Etag
headers and only downloads the page if it has changed.
If there are more complex requirements than just the URL changing, you can use a generic dependency to get the output of a command and use that as a dependency.
Generic Command Dependencies
This command works only in Xvc repositories.
$ git init
...
$ xvc init
You can use the output of a shell command as a dependency to a step. When the command is run, the output hash is saved to compare and the step is invalidated when the output of the command changed.
You can use this for any command that outputs a string.
$ xvc pipeline step new --step-name morning-message --command "echo 'Good Morning!'"
$ xvc pipeline step dependency --step-name morning-message --generic 'date +%F'
The step is invalidated when the date changes and the step is run again.
$ xvc pipeline run
[OUT] [morning-message] Good Morning!
[DONE] morning-message (echo 'Good Morning!')
The step won't run until tomorrow, when date +%F
changes.
$ xvc pipeline run
[OUT] [morning-message] Good Morning!
[DONE] [morning-message] (echo 'Good Morning!')
You can mimic all kinds of pipeline behavior with this generic dependency.
For example, if you want to run a command when directory contents change, you can depend on the output of ls -lR
:
$ xvc pipeline step new --step-name directory-contents --command "echo 'Files changed'"
$ xvc pipeline step dependency --step-name directory-contents --generic 'ls'
$ xvc pipeline run
[OUT] [directory-contents] Files changed
[DONE] [directory-contents] (echo 'Files changed')
When you add a file to the directory, the step is invalidated and run again:
$ xvc pipeline run
$ xvc-test-helper generate-random-file new-file.txt
$ xvc pipeline run
[OUT] [directory-contents] Files changed
[DONE] [directory-contents] (echo 'Files changed')
Caveats
Tips
Most shells support editing longer commands with an editor. For bash, you can use Ctrl+X Ctrl+E
.
Pipeline commands can get longer quickly. You can use xvc aliases for shorter
versions. Type source $(xvc aliases)
to load the aliases into your shell.
xvc pipeline step output
Purpose
Define an output (file, metrics or plots) to an already existing step in the pipeline.
Synopsis
$ xvc pipeline step output --help
Add an output to a step
Usage: xvc pipeline step output [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to add the output to
--output-file <FILES> Add a file output to the step. Can be used multiple times
--output-metric <METRICS> Add a metric output to the step. Can be used multiple times
--output-image <IMAGES> Add an image output to the step. Can be used multiple times
-h, --help Print help
Examples
Caveats
xvc pipeline step show
Purpose
Print the steps of a pipeline.
Synopsis
$ xvc pipeline step show --help
Print step configuration
Usage: xvc pipeline step show --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to show
-h, --help Print help
Examples
Caveats
xvc pipeline step update
Purpose
Update the name, running condition, or command of a step.
Synopsis
$ xvc pipeline step update --help
Update step options
Usage: xvc pipeline step update [OPTIONS] --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to update. The step should already be defined
-c, --command <COMMAND> Step command to run
--when <WHEN> When to run the command. One of always, never, by_dependencies (default). This is used to freeze or invalidate a step manually
-h, --help Print help
Examples
Caveats
xvc pipeline step remove
Purpose
Remove a step and all its dependencies and outputs from the pipeline.
Synopsis
$ xvc pipeline step remove --help
Remove a step from a pipeline
Usage: xvc pipeline step remove --step-name <STEP_NAME>
Options:
-s, --step-name <STEP_NAME> Name of the step to remove
-h, --help Print help
Examples
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Let's create a few steps and make them depend on each other.
$ xvc pipeline step new --step-name hello --command 'echo hello >> hello.txt'
$ xvc pipeline step new --step-name world --command 'echo world >> world.txt'
$ xvc pipeline step new --step-name from --command 'echo from >> from.txt'
$ xvc pipeline step new --step-name xvc --command 'echo xvc >> xvc.txt'
Let's specify the outputs as well.
$ xvc pipeline step output --step-name hello --output-file hello.txt
$ xvc pipeline step output --step-name world --output-file world.txt
$ xvc pipeline step output --step-name from --output-file from.txt
$ xvc pipeline step output --step-name xvc --output-file xvc.txt
Now we can add dependencies between them.
$ xvc pipeline step dependency --step-name xvc --step from
$ xvc pipeline step dependency --step-name from --file world.txt
$ xvc pipeline step dependency --step-name world --step hello
Now the pipeline looks like this:
$ xvc pipeline step list
hello: echo hello >> hello.txt (by_dependencies)
world: echo world >> world.txt (by_dependencies)
from: echo from >> from.txt (by_dependencies)
xvc: echo xvc >> xvc.txt (by_dependencies)
$ xvc pipeline dag --format mermaid
flowchart TD
n0["hello"]
n1["hello.txt"] --> n0
n2["world"]
n0["hello"] --> n2
n3["world.txt"] --> n2
n4["from"]
n3["world.txt"] --> n4
n5["from.txt"] --> n4
n6["xvc"]
n4["from"] --> n6
n7["xvc.txt"] --> n6
When we remove a step, all its dependencies and outputs are removed as well.
$ xvc -vv pipeline step remove --step-name from
[INFO] Removing dep: file(world.txt)
[INFO] Removing dep step(from) from xvc
[INFO] Removing output: File
[INFO] Removing step: from
$ xvc pipeline step list
hello: echo hello >> hello.txt (by_dependencies)
world: echo world >> world.txt (by_dependencies)
xvc: echo xvc >> xvc.txt (by_dependencies)
$ xvc pipeline dag --format mermaid
flowchart TD
n0["hello"]
n1["hello.txt"] --> n0
n2["world"]
n0["hello"] --> n2
n3["world.txt"] --> n2
n4["xvc"]
n5["xvc.txt"] --> n4
xvc pipeline run
Synopsis
$ xvc pipeline run --help
Run a pipeline
Usage: xvc pipeline run [OPTIONS]
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline to run
-h, --help Print help
Examples
Pipelines require Xvc to be initialized before running.
$ git init
...
$ xvc init
Xvc defines a default pipeline and any steps added without specifying the pipeline will be added to it.
$ xvc pipeline list
+---------+---------+
| Name | Run Dir |
+===================+
| default | |
+---------+---------+
Create a new step in this pipeline with xvc pipeline step new
command.
$ xvc pipeline step new --step-name hello --command "echo hello"
$ xvc pipeline dag --format=mermaid
flowchart TD
n0["hello"]
You can run the default pipeline without specifying its name.
$ xvc pipeline run
[OUT] [hello] hello
[DONE] [hello] (echo hello)
Note that, when a step has no dependencies, it's set to always run if it's not set to run never explicitly.
$ xvc pipeline step update --step-name hello --when never
$ xvc pipeline run
Run a specific pipeline
You can run a specific pipeline by specifying its name with --name
option.
$ xvc pipeline new --pipeline-name my-pipeline
$ xvc pipeline --pipeline-name my-pipeline step new --step-name my-hello --command "echo 'hello from my-pipeline'"
$ xvc pipeline run --pipeline-name my-pipeline
[OUT] [my-hello] hello from my-pipeline
[DONE] [my-hello] (echo 'hello from my-pipeline')
xvc pipeline delete
Synopsis
$ xvc pipeline delete --help
Delete a pipeline
Usage: xvc pipeline delete --pipeline-name <PIPELINE_NAME>
Options:
-p, --pipeline-name <PIPELINE_NAME> Name or GUID of the pipeline to be deleted
-h, --help Print help
xvc pipeline export
Synopsis
$ xvc pipeline export --help
Export the pipeline to a YAML or JSON file to edit
Usage: xvc pipeline export [OPTIONS]
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline to export
--file <FILE> File to write the pipeline. Writes to stdout if not set
--format <FORMAT> Output format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
-h, --help Print help
Examples
You can export the pipeline you created to a JSON or YAML file to edit and restore using xvc pipeline import
. This allows to fix typos and update commands in place, and see pipeline internals
for debugging.
Xvc doesn't guarantee that the format of these files will be compatible across versions. You can use these files to share pipeline definitions but it may not be a good way to store pipeline definitions for longer periods.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
Let's start by defining a steps in the pipeline.
$ xvc pipeline step new --step-name step1 --command 'touch abc.txt'
$ xvc pipeline step new --step-name step2 --command 'touch def.txt'
Adding a few dependencies.
$ xvc pipeline step dependency -s step2 --step step1
$ xvc pipeline step dependency -s step2 --glob '*.txt'
$ xvc pipeline step dependency -s step2 --glob-items '*.txt'
$ xvc pipeline step dependency -s step2 --param model.conv_units
$ xvc pipeline step dependency -s step2 --regex requirements.txt:/^tensorflow
$ xvc pipeline step dependency -s step2 --regex-items requirements.txt:/^tensorflow
$ xvc pipeline step dependency -s step2 --line-items params.yaml::1-20
$ xvc pipeline step dependency -s step2 --lines params.yaml::1-20
$ xvc pipeline step dependency -s step2 --url 'https://example.com'
$ xvc pipeline step dependency -s step2 --generic 'ping -c 2 example.com'
$ xvc pipeline step output -s step2 --output-metric metrics.json
$ xvc pipeline step output -s step2 --output-file def.txt
$ xvc pipeline step output -s step2 --output-image plots/confusion.png
If you don't specify a filename, the default format is JSON and the output will be sent to stdout.
$ xvc pipeline export
{
"name": "default",
"steps": [
{
"command": "touch abc.txt",
"dependencies": [],
"invalidate": "ByDependencies",
"name": "step1",
"outputs": []
},
{
"command": "touch def.txt",
"dependencies": [
{
"Step": {
"name": "step1"
}
},
{
"Generic": {
"generic_command": "ping -c 2 example.com",
"output_digest": null
}
},
{
"GlobItems": {
"glob": "*.txt",
"xvc_path_content_digest_map": {},
"xvc_path_metadata_map": {}
}
},
{
"Glob": {
"content_digest": null,
"glob": "*.txt",
"xvc_metadata_digest": null,
"xvc_paths_digest": null
}
},
{
"RegexItems": {
"lines": [],
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
}
},
{
"Regex": {
"lines_digest": null,
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
}
},
{
"Param": {
"format": "YAML",
"key": "model.conv_units",
"path": "params.yaml",
"value": null,
"xvc_metadata": null
}
},
{
"LineItems": {
"begin": 1,
"end": 20,
"lines": [],
"path": "params.yaml",
"xvc_metadata": null
}
},
{
"Lines": {
"begin": 1,
"digest": null,
"end": 20,
"path": "params.yaml",
"xvc_metadata": null
}
},
{
"UrlDigest": {
"etag": null,
"last_modified": null,
"url": "https://example.com/",
"url_content_digest": null
}
}
],
"invalidate": "ByDependencies",
"name": "step2",
"outputs": [
{
"File": {
"path": "def.txt"
}
},
{
"Metric": {
"format": "JSON",
"path": "metrics.json"
}
},
{
"Image": {
"path": "plots/confusion.png"
}
}
]
}
],
"version": 1,
"workdir": ""
}
If you want to set the format, you can specify the --format
option.
$ xvc pipeline export --format yaml
version: 1
name: default
workdir: ''
steps:
- name: step1
command: touch abc.txt
invalidate: ByDependencies
dependencies: []
outputs: []
- name: step2
command: touch def.txt
invalidate: ByDependencies
dependencies:
- !Step
name: step1
- !Generic
generic_command: ping -c 2 example.com
output_digest: null
- !GlobItems
glob: '*.txt'
xvc_path_metadata_map: {}
xvc_path_content_digest_map: {}
- !Glob
glob: '*.txt'
xvc_paths_digest: null
xvc_metadata_digest: null
content_digest: null
- !RegexItems
path: requirements.txt
regex: ^tensorflow
lines: []
xvc_metadata: null
- !Regex
path: requirements.txt
regex: ^tensorflow
lines_digest: null
xvc_metadata: null
- !Param
format: YAML
path: params.yaml
key: model.conv_units
value: null
xvc_metadata: null
- !LineItems
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
lines: []
- !Lines
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
digest: null
- !UrlDigest
url: https://example.com/
etag: null
last_modified: null
url_content_digest: null
outputs:
- !File
path: def.txt
- !Metric
path: metrics.json
format: JSON
- !Image
path: plots/confusion.png
When you specify a file name, the output format is inferred from the extension.
$ xvc pipeline export --file pipeline.yaml
$ cat pipeline.yaml
version: 1
name: default
workdir: ''
steps:
- name: step1
command: touch abc.txt
invalidate: ByDependencies
dependencies: []
outputs: []
- name: step2
command: touch def.txt
invalidate: ByDependencies
dependencies:
- !Step
name: step1
- !Generic
generic_command: ping -c 2 example.com
output_digest: null
- !GlobItems
glob: '*.txt'
xvc_path_metadata_map: {}
xvc_path_content_digest_map: {}
- !Glob
glob: '*.txt'
xvc_paths_digest: null
xvc_metadata_digest: null
content_digest: null
- !RegexItems
path: requirements.txt
regex: ^tensorflow
lines: []
xvc_metadata: null
- !Regex
path: requirements.txt
regex: ^tensorflow
lines_digest: null
xvc_metadata: null
- !Param
format: YAML
path: params.yaml
key: model.conv_units
value: null
xvc_metadata: null
- !LineItems
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
lines: []
- !Lines
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
digest: null
- !UrlDigest
url: https://example.com/
etag: null
last_modified: null
url_content_digest: null
outputs:
- !File
path: def.txt
- !Metric
path: metrics.json
format: JSON
- !Image
path: plots/confusion.png
xvc pipeline import
Synopsis
$ xvc pipeline import --help
Import the pipeline from a file
Usage: xvc pipeline import [OPTIONS]
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline to import. If not set, the name from the file is used
--file <FILE> File to read the pipeline. Use stdin if not specified
--format <FORMAT> Input format. One of json or yaml. If not set, the format is guessed from the file extension. If the file extension is not set, json is used as default
--overwrite Overwrite the pipeline even if the name already exists
-h, --help Print help
Examples
This command is used to import pipelines exported with xvc pipeline export
.
You can edit and import the pipelines exported with the command.
Xvc doesn't guarantee that the format of these files will be compatible across versions. You can use these files to share pipeline definitions but it may not be a good way to store pipeline definitions for longer periods.
This command works only in Xvc repositories.
$ git init
...
$ xvc init
The following file generated with xvc pipeline export
.
$ cat pipeline.yaml
version: 1
name: default
workdir: ''
steps:
- name: step1
command: touch abc.txt
invalidate: ByDependencies
dependencies: []
outputs: []
- name: step2
command: touch def.txt
invalidate: ByDependencies
dependencies:
- !Step
name: step1
- !Generic
generic_command: ping -c 2 example.com
output_digest: null
- !GlobItems
glob: '*.txt'
xvc_path_metadata_map: {}
xvc_path_content_digest_map: {}
- !Glob
glob: '*.txt'
xvc_paths_digest: null
xvc_metadata_digest: null
content_digest: null
- !RegexItems
path: requirements.txt
regex: ^tensorflow
lines: []
xvc_metadata: null
- !Regex
path: requirements.txt
regex: ^tensorflow
lines_digest: null
xvc_metadata: null
- !Param
format: YAML
path: params.yaml
key: model.conv_units
value: null
xvc_metadata: null
- !LineItems
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
lines: []
- !Lines
path: params.yaml
begin: 1
end: 20
xvc_metadata: null
digest: null
- !UrlDigest
url: https://example.com/
etag: null
last_modified: null
url_content_digest: null
outputs:
- !File
path: def.txt
- !Metric
path: metrics.json
format: JSON
- !Image
path: plots/confusion.png
You can import this file to construct the pipeline at once.
Note that the export
command outputs JSON by default.
$ xvc pipeline import --file pipeline.yaml --overwrite
$ xvc pipeline export
{
"name": "default",
"steps": [
{
"command": "touch abc.txt",
"dependencies": [],
"invalidate": "ByDependencies",
"name": "step1",
"outputs": []
},
{
"command": "touch def.txt",
"dependencies": [
{
"Step": {
"name": "step1"
}
},
{
"Generic": {
"generic_command": "ping -c 2 example.com",
"output_digest": null
}
},
{
"GlobItems": {
"glob": "*.txt",
"xvc_path_content_digest_map": {},
"xvc_path_metadata_map": {}
}
},
{
"Glob": {
"content_digest": null,
"glob": "*.txt",
"xvc_metadata_digest": null,
"xvc_paths_digest": null
}
},
{
"RegexItems": {
"lines": [],
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
}
},
{
"Regex": {
"lines_digest": null,
"path": "requirements.txt",
"regex": "^tensorflow",
"xvc_metadata": null
}
},
{
"Param": {
"format": "YAML",
"key": "model.conv_units",
"path": "params.yaml",
"value": null,
"xvc_metadata": null
}
},
{
"LineItems": {
"begin": 1,
"end": 20,
"lines": [],
"path": "params.yaml",
"xvc_metadata": null
}
},
{
"Lines": {
"begin": 1,
"digest": null,
"end": 20,
"path": "params.yaml",
"xvc_metadata": null
}
},
{
"UrlDigest": {
"etag": null,
"last_modified": null,
"url": "https://example.com/",
"url_content_digest": null
}
}
],
"invalidate": "ByDependencies",
"name": "step2",
"outputs": [
{
"File": {
"path": "def.txt"
}
},
{
"Metric": {
"format": "JSON",
"path": "metrics.json"
}
},
{
"Image": {
"path": "plots/confusion.png"
}
}
]
}
],
"version": 1,
"workdir": ""
}
If you don't supply the --overwrite
option, Xvc will report an error and quit.
$ xvc pipeline import --file pipeline.yaml
? 1
[ERROR] Pipeline Error: Pipeline default already found
Error: PipelineError { source: PipelineAlreadyFound { name: "default" } }
You can specify a new name for the pipeline and it will override the name set in the file. This way you can edit and import similar pipelines with minor differences.
$ xvc pipeline import --pipeline-name another-pipeline --file pipeline.yaml
You can also use stdin to import a pipeline but you must specify the input format.
xvc pipeline update
Synopsis
$ xvc pipeline update --help
Update the name and other attributes of a pipeline
Usage: xvc pipeline update [OPTIONS]
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline this command applies to
--rename <RENAME> Rename the pipeline to
--workdir <WORKDIR> Set the working directory
--set-default set this pipeline default
-h, --help Print help
xvc pipeline dag
Synopsis
$ xvc pipeline dag --help
Generate a dot or mermaid diagram for the pipeline
Usage: xvc pipeline dag [OPTIONS]
Options:
-p, --pipeline-name <PIPELINE_NAME> Name of the pipeline to generate the diagram
--file <FILE> Output file. Writes to stdout if not set
--format <FORMAT> Format for graph. Either dot or mermaid [default: dot]
-h, --help Print help
You can visualize the pipeline you defined with xvc pipeline set of command with the xvc pipeline dag
command. It will generate a dot or mermaid diagram for the pipeline.
Examples
As all other pipeline commands, this requires an Xvc repository.
$ git init --initial-branch=main
Initialized empty Git repository in [CWD]/.git/
$ xvc init
All steps of the pipeline are shown as nodes in the graph.
We create a dependency between the two steps by using the --dependencies
flag to make them run sequentially.
$ xvc pipeline step new --step-name preprocess --command "echo 'preprocess'"
$ xvc pipeline step new --step-name train --command "echo 'train'"
$ xvc pipeline step dependency --step-name train --step preprocess
It's not very readable but you can supply the result directly to dot and get a more useful output.
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n1;}
The output after dot -Tsvg
is:
When you add a dependency between two steps, the graph shows it as a node. For example,
$ xvc pipeline step dependency --step-name preprocess --glob 'data/*'
$ xvc pipeline dag
digraph pipeline{n0[shape=box;label="preprocess";];n1[shape=folder;label="data/*";];n1->n0;n2[shape=box;label="train";];n0[shape=box;label="preprocess";];n0->n2;}
You can use --mermaid
option to get a mermaid.js diagram.
$ xvc pipeline dag --format=mermaid
flowchart TD
n0["preprocess"]
n1["data/*"] --> n0
n2["train"]
n0["preprocess"] --> n2
The output can be used in Mermaid Live Editor or any web page that support the format.
flowchart TD n0["train"] n1["preprocess"] --> n0 n1["preprocess"] n2["data/*"] --> n1
Storage management commands (xvc storage
)
Purpose
Xvc allows to keep tracked content in storages.
These can be in either local file system or the cloud.
xvc storage
set of commands allow to configure, list and delete these storages.
Synopsis
$ xvc storage --help
Storage (cloud) management commands
Usage: xvc storage <COMMAND>
Commands:
list List all configured storages
remove Remove a storage configuration
new Configure a new storage
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
xvc storage list
Purpose
List all configured storages with their names and guids.
Synopsis
$ xvc storage list --help
List all configured storages
Usage: xvc storage list
Options:
-h, --help Print help
Examples
List all storage configurations in the repository:
$ xvc storage list
Caveats
This one uses the local configuration and doesn't try to connect storages. If it's listed with the command, it doesn't mean it's guaranteed to be able to pull or push.
xvc storage remove
Purpose
Remove unused or inaccessible storages from the configuration
Synopsis
$ xvc storage remove --help
Remove a storage configuration.
This doesn't delete any files in the storage.
Usage: xvc storage remove --name <NAME>
Options:
--name <NAME>
Name of the storage to be deleted
-h, --help
Print help (see a summary with '-h')
Caveats
xvc storage new
Synopsis
$ xvc storage new --help
Configure a new storage
Usage: xvc storage new <COMMAND>
Commands:
local Add a new local storage
generic Add a new generic storage
rsync Add a new rsync storages
s3 Add a new S3 storage
minio Add a new Minio storage
digital-ocean Add a new Digital Ocean storage
r2 Add a new R2 storage
gcs Add a new Google Cloud Storage storage
wasabi Add a new Wasabi storage
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
xvc storage new local
Purpose
Create a new storage reachable from the local filesystem. It allows to keep tracked file contents in a different directory for backup or sharing purposes.
Synopsis
$ xvc storage new local --help
Add a new local storage
A local storage is a directory accessible from the local file system. Xvc will use common file operations for this directory without accessing the network.
Usage: xvc storage new local --path <PATH> --name <NAME>
Options:
--path <PATH>
Directory (outside the repository) to be set as a storage
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
-h, --help
Print help (see a summary with '-h')
Examples
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
Now, you can define a local directory as storage and begin to use it.
$ xvc storage new local --name backup --path '../my-local-storage'
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa/0.bin
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa
[DELETE] [CWD]/.xvc/b3/3c6/70f
[DELETE] [CWD]/.xvc/b3/3c6
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d/0.bin
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d
[DELETE] [CWD]/.xvc/b3/7aa/354
[DELETE] [CWD]/.xvc/b3/7aa
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb/0.bin
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb
[DELETE] [CWD]/.xvc/b3/d7d/629
[DELETE] [CWD]/.xvc/b3/d7d
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
Caveats
--name NAME
is not checked to be unique but you should use unique storage names to refer them later.
--path PATH
should be accessible for writing and shouldn't already exist.
Technical Details
The command creates the PATH
and a new file under PATH
called .xvc-guid
.
The file contains the unique identifier for this storage.
The same identifier is also recorded to the project.
A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}}
is saved to PATH/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}
.
{{REPO_ID}}
is the unique identifier for the repository created during xvc init
.
Hence if you use a common storage for different Xvc projects, their files are kept under different directories.
There is no inter-project deduplication. (yet)
In the future, there may be an option to have a common storage for multiple projects at the same location. Please comment below if this is a common use case.
xvc storage new generic
Purpose
Create a new storage that uses shell commands to send and retrieve cache files. It allows to keep tracked files in any kind of service that can be used command line.
Synopsis
$ xvc storage new generic --help
Add a new generic storage.
⚠️ Please note that this is an advanced method to configure storages. You may damage your repository and local and storage files with incorrect configurations.
Please see https://docs.xvc.dev/ref/xvc-storage-new-generic.html for examples and make necessary backups.
Usage: xvc storage new generic [OPTIONS] --name <NAME> --init <INIT_COMMAND> --list <LIST_COMMAND> --download <DOWNLOAD_COMMAND> --upload <UPLOAD_COMMAND> --delete <DELETE_COMMAND>
Options:
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
-i, --init <INIT_COMMAND>
Command to initialize the storage. This command is run once after defining the storage.
You can use {URL} and {STORAGE_DIR} as shortcuts.
-l, --list <LIST_COMMAND>
Command to list the files in storage
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-d, --download <DOWNLOAD_COMMAND>
Command to download a file from storage.
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-u, --upload <UPLOAD_COMMAND>
Command to upload a file to storage.
You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options.
-D, --delete <DELETE_COMMAND>
The delete command to remove a file from storage You can use {URL} and {STORAGE_DIR} placeholders and define values for these with --url and --storage_dir options
-M, --processes <MAX_PROCESSES>
Number of maximum processes to run simultaneously
[default: 1]
--url <URL>
You can set a string to replace {URL} placeholder in commands
--storage-dir <STORAGE_DIR>
You can set a string to replace {STORAGE_DIR} placeholder in commands
-h, --help
Print help (see a summary with '-h')
You can use the following placeholders in your commands. These are replaced with the actual paths in runtime and commands are run with concrete paths.
{URL}
: The content of--url
option. (default ""){STORAGE_DIR}
Content of--storage-dir
option. (default ""){RELATIVE_CACHE_PATH}
The portion of the cache path after.xvc/
.{ABSOLUTE_CACHE_PATH}
The absolute local path for the cache element.{RELATIVE_CACHE_DIR}
The portion of directory that contains the file after.xvc/
.{ABSOLUTE_CACHE_DIR}
The portion of the local directory that contains the file after.xvc
.{XVC_GUID}
: Repository GUID used in storages to differ repository elements{FULL_STORAGE_PATH}
: Concatenation of{URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_PATH}
{FULL_STORAGE_DIR}
: Concatenation of{URL}{STORAGE_DIR}{XVC_GUID}/{RELATIVE_CACHE_DIR}
{LOCAL_GUID_FILE_PATH}
: The path that contains guid of the storage locally. Used only in--init
option.{STORAGE_GUID_FILE_PATH}
: The path that should have guid of the storage, in storage. Used only in--init
option.
Examples
Create a generic storage in the same filesystem
You can create a storage that's using shell commands to send and receive files to another location in the file system.
There are two variables that you can use in the commands.
For a storage in the same file system, --url
could be blank and --storage-dir
could be the location you want to define.
$ xvc storage new-generic
--url ""
--storage-dir $HOME/my-xvc-storage
...
You need to specify the commands for the following operations:
init
: The command that's used to create the directory that will be used as a storage. It should also copyXVC_STORAGE_GUID_FILENAME
(currently.xvc-guid
) to that location. This file is used to identify the location as an Xvc storage.
$ xvc storage new-generic
...
--init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
...
Note that if the command doesn't contain {LOCAL_GUID_FILE_PATH}
and {STORAGE_GUID_FILE_PATH}
variables, it won't be run and Xvc will report an error.
list
: This operation should list all files under{URL}{STORAGE_DIR}
. The list is filtered through a regex that matches the format of the paths. Hence, even the command lists all files in the storage, Xvc will consider only the relevant paths.
All paths should be listed in separate lines.
$ xvc storage new-generic
...
--list 'ls -1 {URL}{STORAGE_DIR}'
...
upload
: The command that will copy a file from local cache to the storage. Normally, it uses{ABSOLUTE_CACHE_PATH}
variable. For the local file system, we also need to create a directory before copying.
$ xvc storage new-generic
...
--upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
...
download
: This command will be used to copy from storage to the local cache. It must create local cache directory as well.
$ xvc storage new-generic
...
--download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
...
delete
: This operation is used to delete the storage file. It shouldn't touch the local file in any way, otherwise you may lose data.
$ xvc storage new-generic
...
--delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
...
In total, the command you write is the following. It defines all operations of this storage.
$ xvc storage new-generic
--url ""
--storage-dir $HOME/my-xvc-storage
--init 'mkdir -p {STORAGE_DIR} ; cp {LOCAL_GUID_FILE_PATH} {STORAGE_GUID_FILE_PATH}'
--list 'ls -1 {URL}{STORAGE_DIR}'
--upload 'mkdir -p {FULL_STORAGE_DIR} && cp {ABSOLUTE_CACHE_PATH} {FULL_STORAGE_PATH}'
--download 'mkdir -p {ABSOLUTE_CACHE_DIR} && cp {FULL_STORAGE_PATH} {ABSOLUTE_CACHE_PATH}'
--delete 'rm -f {FULL_STORAGE_PATH} ; rmdir {FULL_STORAGE_DIR}'
Create a storage using rsync
Rsync is found for all popular platforms to copy file contents. Xvc can use it to maintain a storage if you already have a working rsync setup.
We need to define operations for init
, upload
, download
, list
and delete
with rsync or ssh.
Some of the commands need ssh
to perform operations, like creating a directory.
We'll use placeholders for paths.
As rsync URL format is slightly different than SSH, we will define the commands verbosely.
Suppose you want to use your account at user@example.com
to store your Xvc files.
You want to store the files under /home/user/my-xvc-storage
.
We assume you have configured public key authentication for your account. Xvc doesn't receive user input during storage operations, and can't receive your password during runs.
We first define these as our --url
and --storage-dir
options.
$ xvc --url user@example.com
--storage-dir '/home/user/my-xvc-storage'
...
Initialization command must create this directory and copy the storage GUID file to its respective location.
$ xvc
...
--init "ssh {URL} 'mkdir -p {STORAGE_DIR}' ; rsync -av '{LOCAL_GUID_FILE_PATH}' '{URL}:{STORAGE_GUID_FILE_PATH}'"
Note the use of :
in rsync
command.
As it doesn't support ssh://
URLs currently, we are using a form that's compatible with both ssh and rsync as URL.
It may be possible to use &&
between ssh
and rsync
commands, but if the first command fails (e.g. the directory already exists), we still want to copy the guid file.
Caveats
Technical Details
The paths in list
commands are filtered through a regex.
They are matched against {REPO_GUID}/{RELATIVE_CACHE_DIR}/0
pattern and only the {RELATIVE_CACHE_DIR}
portion is reported.
Any line that doesn't conform to this pattern is ignored.
You can any listing command that returns a recursive file list, and only the pattern matching elements are considered.
xvc storage new s3
Purpose
Configure an S3 (or a compatible) service as an Xvc storage.
Synopsis
$ xvc storage new rsync --help
Add a new rsync storages
Uses rsync in separate processes to communicate. This can be used when you already have an SSH/Rsync connection. It doesn't prompt for any passwords. The connection must be set up with ssh keys beforehand.
Usage: xvc storage new rsync [OPTIONS] --name <NAME> --host <HOST> --storage-dir <STORAGE_DIR>
Options:
-n, --name <NAME>
Name of the storage.
Recommended to keep this name unique to refer easily.
--host <HOST>
Hostname for the connection in the form host.example.com (without @, : or protocol)
--port <PORT>
Port number for the connection in the form 22. Doesn't add port number to connection string if not given
--user <USER>
User name for the connection, the part before @ in user@example.com (without @, hostname). User name isn't included in connection strings if not given
--storage-dir <STORAGE_DIR>
storage directory in the host to store the files
-h, --help
Print help (see a summary with '-h')
Examples
You must setup an SSH connection
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new rsync --name backup --host e1.xvc.dev --user iex --storage-dir /tmp/xvc-backup/
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa/0.bin
[DELETE] [CWD]/.xvc/b3/3c6/70f/e91055c2be2e87890dba1e952d656d1e70dd196bf5530d379243c6e4aa
[DELETE] [CWD]/.xvc/b3/3c6/70f
[DELETE] [CWD]/.xvc/b3/3c6
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d/0.bin
[DELETE] [CWD]/.xvc/b3/7aa/354/0225bd33702c239454b63b31d1ea25721cbbfb491d6139d0b85b82d15d
[DELETE] [CWD]/.xvc/b3/7aa/354
[DELETE] [CWD]/.xvc/b3/7aa
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb/0.bin
[DELETE] [CWD]/.xvc/b3/d7d/629/677c6d8df55ab3a1d694453c59f3ca0df494d3dc190aeef1e00abd96eb
[DELETE] [CWD]/.xvc/b3/d7d/629
[DELETE] [CWD]/.xvc/b3/d7d
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
xvc storage new s3
Purpose
Configure an S3 (or a compatible) service as an Xvc storage.
Synopsis
$ xvc storage new s3 --help
Add a new S3 storage
Reads credentials from `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new s3 [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
--bucket-name <BUCKET_NAME>
S3 bucket name
--region <REGION>
AWS region
-h, --help
Print help (see a summary with '-h')
Examples
Before calling any commands that use this storage, you must set the following environment variables.
AWS_ACCESS_KEY_ID
orXVC_STORAGE_ACCESS_KEY_ID_<storage_name>
: The access key of the Amazon Web Services account. The second form is used when you have multiple accounts and you want to use a specific one.AWS_SECRET_ACCESS_KEY
orXVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>
: The secret key of the Amazon Web Services account. The second form is used when you have multiple accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new s3 --name backup --bucket-name xvc-test --region eu-central-1 --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
xvc storage new gcs
Purpose
Configure an Google Cloud Storage service as an Xvc storage.
Synopsis
$ xvc storage new gcs --help
Add a new Google Cloud Storage storage
Reads credentials from `GCS_ACCESS_KEY_ID` and `GCS_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new gcs [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server, e.g., europe-west3
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
Please configure S3 compatible interface to your Google Cloud Storage account before using this command.
Before calling any commands that use this storage, you must set the following environment variables.
GCS_ACCESS_KEY_ID
orXVC_STORAGE_ACCESS_KEY_ID_<storage_name>
: The access key of the Google Cloud Storage account. The second form is used when you have multiple storages with different access keys.GCS_SECRET_ACCESS_KEY
orXVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>
: The secret key of the Google Cloud Storage account. The second form is used when you have multiple storages with different access keys.
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new gcs --name backup --bucket-name xvc-test --region europe-west-3 --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
xvc storage new minio
Purpose
Create a new Xvc storage on a MinIO instance. It allows to store tracked file contents in a Minio server.
Synopsis
$ xvc storage new minio --help
Add a new Minio storage
Reads credentials from `MINIO_ACCESS_KEY` and `MINIO_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new minio [OPTIONS] --name <NAME> --endpoint <ENDPOINT> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--endpoint <ENDPOINT>
Minio server url in the form https://myserver.example.com:9090
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
Before calling any commands that use this storage, you must set the following environment variables.
MINIO_ACCESS_KEY_ID
orXVC_STORAGE_ACCESS_KEY_ID_<storage_name>
: The access key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.MINIO_SECRET_ACCESS_KEY
orXVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>
: The secret key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new minio --name backup --endpoint http://e1.xvc.dev:9000 --bucket-name xvc-tests --region us-east-1 --storage-prefix xvc
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
Caveats
--name NAME
is not verified to be unique but you should use unique storage names to refer them later.
You can also use storage GUIDs listed by xvc storage list
to refer to storages.
You must have a valid connection to the server.
Xvc uses Minio API port (9001, by default) to connect to the server. Ensure that it's accessible.
For reasons caused from the underlying library, Xvc tries to connect http://xvc-bucket.example.com:9001
if you give http://example.com:9001
as the endpoint, and xvc-bucket
as the bucket name.
You may need to consider this when you have servers running in exact URLs.
If you have a http://minio.example.com:9001
as a Minio server, you may want to supply http://example.com:9001
as the endpoint, and minio
as the bucket name to form the correct URL.
This behavior may change in the future.
Technical Details
This command requires Xvc to be compiled with minio
feature, which is on by default.
It uses Rust async features via rust-s3
crate, and may add some bulk to the binary.
If you want to compile Xvc without these features, please refer to How to Compile Xvc document.
The command creates .xvc-guid
file in http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/.xvc-guid
.
The file contains the unique identifier for this storage.
The same identifier is also recorded to the project.
A file that's found in .xvc/{{HASH_PREFIX}}/{{CACHE_PATH}}
is saved to http://{{BUCKET-NAME}}.{{ENDPOINT}}/{{STORAGE-PREFIX}}/{{REPO_ID}}/{{HASH_PREFIX}}/{{CACHE_PATH}}
.
{{REPO_ID}}
is the unique identifier for the repository created during xvc init
.
Hence if you use a common storage for different Xvc projects, their files are kept under different directories.
There is no inter-project deduplication.
xvc storage new r2
Purpose
Use Cloudflare R2 as an Xvc storage.
Synopsis
$ xvc storage new r2 --help
Add a new R2 storage
Reads credentials from `R2_ACCESS_KEY_ID` and `R2_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new r2 [OPTIONS] --name <NAME> --account-id <ACCOUNT_ID> --bucket-name <BUCKET_NAME>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--account-id <ACCOUNT_ID>
R2 account ID
--bucket-name <BUCKET_NAME>
Bucket name
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
Before calling any commands that use this storage, you must set the following environment variables.
R2_ACCESS_KEY_ID
orXVC_STORAGE_ACCESS_KEY_ID_<storage_name>
: The access key of the Cloudflare R2 account. The second form is used when you have multiple accounts and you want to use a specific one.R2_SECRET_ACCESS_KEY
orXVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>
: The secret key of the Cloudfare R2 account. The second form is used when you have multiple accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new r2 --name backup --bucket-name xvc-test --account-id e5dcca29209558eb9de6c07ae53b0a6f --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
xvc storage new wasabi
Purpose
Configure a Wasabi service as an Xvc storage.
Synopsis
$ xvc storage new wasabi --help
Add a new Wasabi storage
Reads credentials from `WASABI_ACCESS_KEY_ID` and `WASABI_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new wasabi [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--endpoint <ENDPOINT>
Endpoint for the server, complete with the region if there is
e.g. for eu-central-1 region, use s3.eu-central-1.wasabisys.com as the endpoint.
[default: s3.wasabisys.com]
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
Before calling any commands that use this storage, you must set the following environment variables.
WASABI_ACCESS_KEY_ID
orXVC_STORAGE_ACCESS_KEY_ID_<storage_name>
: The access key of the Wasabi account. The second form is used when you have multiple storage accounts with different access keys.WASABI_SECRET_ACCESS_KEY
orXVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>
: The secret key of the Wasabi account. The second form is used when you have multiple storage accounts with different access keys.
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new wasabi --name backup --bucket-name xvc-test --endpoint s3.wasabisys.com --storage-prefix xvc-storage
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
xvc storage new digital-ocean
Purpose
Configure a Digital Ocean Spaces service as an Xvc storage.
Synopsis
$ xvc storage new digital-ocean --help
Add a new Digital Ocean storage
Reads credentials from `DIGITAL_OCEAN_ACCESS_KEY_ID` and `DIGITAL_OCEAN_SECRET_ACCESS_KEY` environment variables. Alternatively you can use `XVC_STORAGE_ACCESS_KEY_ID_<storage_name>` and `XVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>` environment variables if you have multiple storages of this type.
Usage: xvc storage new digital-ocean [OPTIONS] --name <NAME> --bucket-name <BUCKET_NAME> --region <REGION>
Options:
-n, --name <NAME>
Name of the storage
This must be unique among all storages of the project
--bucket-name <BUCKET_NAME>
Bucket name
--region <REGION>
Region of the server
--storage-prefix <STORAGE_PREFIX>
You can set a directory in the bucket with this prefix
[default: ]
-h, --help
Print help (see a summary with '-h')
Examples
Before calling any commands that use this storage, you must set the following environment variables.
DIGITAL_OCEAN_ACCESS_KEY_ID
orXVC_STORAGE_ACCESS_KEY_ID_<storage_name>
: The access key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.DIGITAL_OCEAN_SECRET_ACCESS_KEY
orXVC_STORAGE_SECRET_ACCESS_KEY_<storage_name>
: The secret key of the Digital Ocean account. The second form is used when you have multiple Digital Ocean accounts and you want to use a specific one.
The command works only in Xvc repositories.
$ git init
...
$ xvc init
$ xvc-test-helper create-directory-tree --directories 1 --files 3 --seed 20230211
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
Xvc only sends and receives tracked files.
$ xvc file track dir-0001
You can define a storage bucket as storage and begin to use it.
$ xvc storage new digital-ocean --name backup --bucket-name xvc --region fra1 --storage-prefix xvc
Send files to this storage.
$ xvc file send dir-0001 --to backup
You can remove the files you sent from your cache and workspace.
$ xvc file remove --from-cache dir-0001/
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88/0.bin
[DELETE] [CWD]/.xvc/b3/1bc/b82/80fcea6acf2362a4ec4ef8512fe2f791f412fed1635009293abedcad88
[DELETE] [CWD]/.xvc/b3/1bc/b82
[DELETE] [CWD]/.xvc/b3/1bc
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a/0.bin
[DELETE] [CWD]/.xvc/b3/863/86d/62e50462e37699d86e9b436526cb3fe40c66e38030e4e25ae4e168193a
[DELETE] [CWD]/.xvc/b3/863/86d
[DELETE] [CWD]/.xvc/b3/863
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef/0.bin
[DELETE] [CWD]/.xvc/b3/f60/f11/901bf063f1448d095f336929929e153025a3ec238128a42ff6e5f080ef
[DELETE] [CWD]/.xvc/b3/f60/f11
[DELETE] [CWD]/.xvc/b3/f60
[DELETE] [CWD]/.xvc/b3
$ rm -rf dir-0001/
Then get back them from the storage.
$ xvc file bring --from backup dir-0001
$ tree dir-0001
dir-0001
├── file-0001.bin
├── file-0002.bin
└── file-0003.bin
1 directory, 3 files
If you want to remove a file and all of its versions from a storage, you can use xvc file remove
command.
$ xvc file remove --from-storage backup dir-0001/
Utilities
xvc root
Purpose
Shows the Xvc root project directory where .xvc/
resides.
Synopsis
$ xvc root --help
Find the root directory of a project
Usage: xvc root [OPTIONS]
Options:
--absolute Show absolute path instead of relative
-h, --help Print help
Examples
xvc root
can be used in scripts to make paths relative to the Xvc project root.
By default, it shows the relative path.
$ xvc root
..
When you supply --absolute
, it prints the absolute path.
$ xvc root --absolute
/home/user/my-xvc-project/
xvc check-ignore
Purpose
Check whether a path is ignored or whitelisted by Xvc.
Synopsis
$ xvc check-ignore --help
Check whether files are ignored with `.xvcignore`
Usage: xvc check-ignore [OPTIONS] [TARGETS]...
Arguments:
[TARGETS]...
Targets to check. If no targets are provided, they are read from stdin
Options:
--ignore-filename <IGNORE_FILENAME>
Filename that contains ignore rules
This can be set to .gitignore to test whether Git and Xvc work the same way.
[default: .xvcignore]
-h, --help
Print help (see a summary with '-h')
Examples
$ git init
...
$ xvc init
You can add files and directories to be ignored by Xvc to .xvcignore
files.
$ zsh -cl "echo 'my-dir/my-file' >> .xvcignore"
By default it checks the files supplied from stdin
.
$ zsh -cl 'echo my-dir/my-file | xvc check-ignore'
[IGNORE] [CWD]/my-dir/my-file
The .xvcignore
file format is identical to .gitignore
file format.
$ cat .xvcignore
# Add patterns of files xvc should ignore, which could improve
# the performance.
# It's in the same format as .gitignore files.
.DS_Store
my-dir/my-file
If you supply paths from the CLI, they are checked against the ignore rules in .xvcignore
.
$ xvc check-ignore my-dir/my-file another-dir/another-file
[IGNORE] [CWD]/my-dir/my-file
[NO MATCH] [CWD]/another-dir/another-file
You can also add whitelist patterns to ,.xvcignore
files.
$ zsh -cl "echo '!another-dir/*' >> .xvcignore"
$ xvc check-ignore my-dir/my-file another-dir/another-file
[IGNORE] [CWD]/my-dir/my-file
[WHITELIST] [CWD]/another-dir/another-file
This utility can be used to check any other ignore rules in other files as well.
You can specify an alternative ignore filename with --ignore-filename
option.
The below command is identical to git check-ignore
and should give the same results.
$ xvc check-ignore --ignore-filename .gitignore
xvc aliases
Synopsis
$ xvc aliases --help
Print command aliases to be sourced in shell files
Usage: xvc aliases
Options:
-h, --help Print help
Examples
You can include aliases in interactive shells.
$ . $(xvc aliases)
$ pvc --help
Pipeline management commands
Usage: xvc pipeline [OPTIONS] <COMMAND>
Commands:
new Add a new pipeline
update Rename, change dir or set a pipeline default
delete Delete a pipeline
run Run a pipeline
list List all pipelines
dag Generate mermaid diagram for the pipeline
export Export the pipeline to a YAML, TOML or JSON file
import Import the pipeline from a file
step Step management commands
help Print this message or the help of the given subcommand(s)
Options:
-n, --name <NAME> Name of the pipeline this command applies to
-h, --help Print help information
If you add the above line to your .bashrc
or .zshrc
, these aliases will always be available.
You can get a list of aliases.
$ xvc aliases
alias xls='xvc file list'
alias pvc='xvc pipeline'
alias fvc='xvc file'
alias xvcf='xvc file'
alias xvcft='xvc file track'
alias xvcfl='xvc file list'
alias xvcfs='xvc file send'
alias xvcfb='xvc file bring'
alias xvcfh='xvc file hash'
alias xvcfco='xvc file checkout'
alias xvcfr='xvc file recheck'
alias xvcp='xvc pipeline'
alias xvcpr='xvc pipeline run'
alias xvcps='xvc pipeline step'
alias xvcpsn='xvc pipeline step new'
alias xvcpsd='xvc pipeline step dependency'
alias xvcpso='xvc pipeline step output'
alias xvcpi='xvc pipeline import'
alias xvcpe='xvc pipeline export'
alias xvcpl='xvc pipeline list'
alias xvcpn='xvc pipeline new'
alias xvcpu='xvc pipeline update'
alias xvcpd='xvc pipeline dag'
alias xvcs='xvc storage'
alias xvcsn='xvc storage new'
alias xvcsl='xvc storage list'
alias xvcsr='xvc storage remove'
If there are aliases that you'd rather not use with Xvc, you can unalias them.
This command is not implemented yet. Please see https://github.com/iesahin/xvc/issues/176 for its progress.
Rust API
xvc
See https://docs.rs/xvc/ for latest version of the Xvc API
xvc-config
See https://docs.rs/xvc-config/ for latest version of the Xvc API
xvc-core
See https://docs.rs/xvc-core/ for latest version of the Xvc API
xvc-ecs
xvc-file
See https://docs.rs/xvc-file/ for latest version of the Xvc API
xvc-logging
See https://docs.rs/xvc-logging/ for latest version of the Xvc API
xvc-pipeline
See https://docs.rs/xvc-pipeline/ for latest version of the Xvc API
xvc-storage
See https://docs.rs/xvc-storage/ for latest version of the Xvc API
xvc-walker
See https://docs.rs/xvc-walker/ for latest version of the Xvc API
Xvc Architecture
The malleability of the material (bits and bytes) we're working with leads to difficulties in architecting software. Unlike real architecture, bits and bytes don't bring natural restrictions. It's not possible to build skyscrapers with mud bricks, and our material is much more malleable. There are too many options, too many ways to solve problems that it's easy to merge in technical mud with the decisions we make.
Software developers created a set of architectural principles to overcome this unlimitation. Most of these principles are bogus. They are not tested on the field. We seldom have software that's still perfectly maintainable after ten years. Usually, reading and understanding the code is more difficult than coming up with a new solution and rewriting it.
In this chapter, we describe the problems, assumptions, and solutions in Xvc's intended domain. It's a work in progress but should give you ideas about the intentions behind decisions.
After two decades, I (un)learned a few basic principles regarding software development.
-
Object Oriented Programming doesn't work. Mixing data and functions (methods) isn't a good way to write programs. It leads to artificial layers and structures that become burdensome the long run. It forces the developer to think about both the data and functionality at the same time. This makes reasoning and solving the problem harder than it should be.
-
Data structures are more important than algorithms. Using a few distinct, well thought data structures is more important than creating the best algorithm. Algorithms are replaceable locally without much peripheral impact. Modifying data structures usually requires updates to all related elements.
-
DRY is overrated. It may be a good principle after you write the first version. However, during the actual development phase, it's not a good idea to try not to repeat yourself. What parts of the program repeat, what parts rhyme, and what should be abstracted can be seen after we write the whole. Trying to apply abstract principles to exploratory development hinders the ability to solve problems as plainly as possible.
-
More errors are done in the name of abstraction than the reverse. Abstractions don't always help. They usually distribute a single functionality across arbitrary layers. In the age of LSP, it's easier to find repeating functionality and merge/rewrite, rather than fixing incorrect assumptions about abstractions. Problems with repeating code are obvious and easier to fix than problems with abstractions.
-
Vertical architecture is more important than horizontal architecture. Vertical architecture means the lower the number of layers between the user and their intention, the better. If the user wants to copy a file, creating a layer of abstract classes to make this more modular doesn't result in more resilient software. If you want to detect whether we're in a Git repository, checking the presence of
.git
directory is simpler than creating a few abstract classes that work for more than one SCM, and implementing abstract methods for them. The architecture shouldn't try to satisfy abstract patterns, it should make the path between the user's action and effect as direct as possible.
Xvc Modules (Crates)
Xvc is composed of modules that can be tested and used independently.
core
module is in the middle of the architecture.
Lower-level crates interface with the OS and convert these to data structures.
Higher levels use these data structures to implement functionality.
For example xvc-walker
crate interfaces with the directories and paths, ignore rules and serves a set of paths with their metadata.
xvc-file
crate uses these to check whether a file is changed or not.
logging
: Logger definitions and debugging macros.walker
: A file system directory walker that checks ignore files. It can also notify the changes in the directory via channels after the initial traversal.config
: Configuration framework that loads configuration from various levels (Default, System, User, Project, Environment) and merges these with command line options for each module.ecs
: The entity-component system responsible for saving and loading state of all data structures, along with their associations and queries.
storage
: Commands and functionality to configure external (local or cloud) locations to store file content.
core
: Xvc specific data structures and utilities.
All user level modules use this module for shared functionality.
file
: Commands to track files and utilities around file management.pipeline
: Commands to define data pipelines as DAGs and run them.
The current dependency graph where lower-level modules are used directly is this:
graph TD xvc --> xvc-file xvc --> xvc-pipeline xvc-file --> xvc-config xvc-file --> xvc-core xvc-file --> xvc-ecs xvc-file --> xvc-logging xvc-file --> xvc-walker xvc-file --> xvc-storage xvc-pipeline --> xvc-config xvc-pipeline --> xvc-core xvc-pipeline --> xvc-ecs xvc-pipeline --> xvc-logging xvc-pipeline --> xvc-walker xvc-config --> xvc-walker xvc-config --> xvc-logging xvc-ecs --> xvc-logging xvc-core --> xvc-config xvc-core --> xvc-logging xvc-core --> xvc-walker xvc-core --> xvc-ecs xvc-walker --> xvc-logging
After the crate interfaces are stabilized, all lower-level functions will be reused from xvc-core
.
It will provide the basic Xvc API.
In this case, the graph will be simplified.
graph TD xvc --> xvc-file xvc --> xvc-pipeline xvc-file --> xvc-core xvc-pipeline --> xvc-core xvc-config --> xvc-walker xvc-config --> xvc-logging xvc-ecs --> xvc-logging xvc-core --> xvc-config xvc-core --> xvc-logging xvc-core --> xvc-walker xvc-core --> xvc-ecs xvc-core --> xvc-storage xvc-walker --> xvc-logging
Any improvement in user-level API will be done higher than xvc-core
levels.
Any improvement in lower-level modules will be done in dependencies of xvc-core
.
Goals
Xvc is an CLI MLOps tool to track file, data, pipeline, experiment, model versions.
It has the following goals:
- Enable to track any kind of files, including large binary, data and models in Git.
- Enable to get subset of these files.
- Enable to remove files from workspace temporarily, and retrieve them from cache.
- Enable to upload and download these files to/from a central server.
- Enable users to run pipelines composed of commands.
- Be able to invalidate pipelines partially.
- Enable to run a pipeline or arbitrary commands as experiments, and store and retrieve them.
Xvc users are data and machine learning professionals that need to track large amounts of data. They also want to run arbitrary commands on this data when it changes. Their goal is to produce better machine learning models and better suited data for their problems.
We have three quality goals:
- Robustness: The system should be robust for basic operations.
- Performance: The overall system performance must be within the ballpark of usual commands like
b3sum
orcp
. - Availability: The system must run on all major operating systems.
Xvc users work with large amounts of data. They want to depend on Xvc for basic operations like tracking file versions, and uploading these to a central location.
They don't want to wait too long for these operations on common hardware.
They would like to download their data to any system running various operating systems.
Xvc Cache
The cache is where Xvc copies the files it tracks.
It's located under the .xvc
directory.
Instead of the file tree that's normally used to address files, it uses the content digest of files to organize them.
In a standard file hierarchy, we have files in paths like /home/iesahin/Photos/my-photo.png
.
Xvc doesn't use such a tree in its cache.
It uses paths like .xvc/b3/a12/b45/d789a...f54/0.png
to refer to files.
Producing the cache path from its content causes cache paths to change when the files are updated.
For example, in a standard file system, if you save another photo on top of my-photo.png
, the first version will be
lost.
Xvc stores these two versions in different locations in the cache, so they are not lost.
There are 4 parts of this cache path.
.xvc
part is the standard directory xvc init
command creates. It resides in the root folder of your project.
b3/
denotes the [digest type] of the content digest.
Xvc supports more than one algorithm to calculate content digests.
[HashAlgorithm][https://docs.rs/xvc-core/latest/xvc_core/types/hashalgorithm/enum.HashAlgorithm.html] enum shows which algorithms are supported.
Each of these algorithms has a 2-letter prefix.
b3
: BLAKE3b2
: BLAKE2ss3
: SHA2-256s2
: SHA3-256
Note that, all these digest algorithms produce 256bits/32 bytes digests. This digest is converted to 64 hexadecimal digits. To keep the total path length shorter, Xvc requires digests to be 32 bytes in length.
The third part in the cache path is these 64 hexadecimal digits in the form a12/b45/d789...f54/
.
64 digits are split into directories to keep the number of directories under one directory lower.
Had Xvc put all cache elements in a single directory, it could lead to degraded performance in some file systems.
With this arrangement, b3/
can contain at most 4096 directories, that contain 4096 directories each.
With usual distribution and good hash algorithms, there won't be more than 4000 elements per directory until 64 billion
files are in the cache. (4000³)
The fourth part is the 0.png
part, that's the file itself with the same extension but with 0
as the basename.
Xvc uses digest as a directory instead of the file name.
There may be times when the file in the cache should be used manually, on cloud storage for example.
The extension is kept for this reason, to make sure that the OS recognizes the file type correctly.
The rename to 0
means, that this is the whole file.
In the future, when Xvc will support splitting large files to transfer to remotes, all parts of the file will be put into this directory.
Storages also use the same cache structure, with an added GUID
part to use single storage for multiple projects.
The Architecture of Xvc Entity Component System
Xvc uses an entity component system (ECS) in its core. ECS architecture is popular among game development, but didn't find popularity in other areas. It's an alternative to Object-Oriented Programming.
There are a few basic notions of ECS architecture. Although it may differ in other frameworks, Xvc assumes the following:
-
An entity is a neutral way of tracking components and their relationships. It doesn't contain any semantics other than being an entity. An entity in Xvc is an atomic integer tuple. (
XvcEntity
) -
A component is a bundle of associated data about these entities. All semantics of entities are described through components. Xvc uses components to keep track of different aspects of file system objects, dependencies, storages, etc.
-
A system is where the components are created and modified. Xvc considers all modules that interact with components as separate systems.
Suppose you want to track a new file in Xvc.
Xvc creates a new entity for this file.
Associates the path (XvcPath
) with this entity.
Creates an instance of XvcMetadata
that represent file size and timestamp, and associates it with this entity.
An XvcDigest
struct is associated with the entity to show the file's content digest.
The difference from OOP is that there is no basic or main object. There is no file
object that contains a
digest
, or a directory
object that is inherited from files.
If you want to work only with digests and want to find the workspace paths associated with them, you can write a
function (system in Entity-Component-System) that starts from XvcDigest
records and collect the associated paths.
If you want to get only the files larger than a certain size, you can work with XvcMetadata
, filter them and get the paths later.
In contrast, in an OOP setting, these data are associated with paths and when you want to do such operations, you need to load paths and their associations first.
OOP way of doing things is usually against the principle of locality.
The whole idea is to be flexible for further changes.
For example, these days Xvc doesn't have notions of data and models. Files are just files.
It doesn't have different functionality for files that are models or data.
When this distinction will be added, an XvcModel
component will be created and associated with the same entity of an
XvcPath
, a set of XvcFeatures
will be associated in the same way XvcMetadata
is associated with XvcPath
.
It will allow working with some paths as model files but it won't require paths to be known beforehand.
There may be other metadata, like features or version associated with models that are more important.
There may be some models without a file system path, maybe living only in memory or in the cloud.
In contrast, OOP would define this either by inheritance (a model is a path) or containment (a model has a path). When you select any of these, it becomes a relationship that must be maintained indefinitely. When you only have an integer that identifies these components, it's much easier to describe models without a path later. There is no predefined relationship between paths and models. You can have paths without models, or models without paths.
The architecture is approximately similar to database modeling. Components are in-memory tables, albeit they are small and mostly contain a few fields. Entities are numeric primary keys. Systems are insert, query and update mechanisms.
Stores
An XvcStore
in its basic definition is a map structure between XvcEntity
and a component type T
It has facilities for persistence, iteration, search and filtering.
It can be considered a system in the usual ECS sense.
Loading and Saving Stores
As our goal is to track data files with Git, stores save and load binary files' metadata to text files. Instead of storing the binary data itself in Git, Xvc stores information about these files to track whether they are changed. By default, these metadata are persisted to JSON. Component types must be serializable because of this. They are meant to be stored to disk in JSON format. Nevertheless, as they are almost always composed of basic types [serde] supports, this doesn't pose a difficulty in usage. The JSON files are then commit to Git.
Note that, there are usually multiple branches in Git repositories. Also multiple users may work on the same branch.
When these text files are reused by the stores, they are modified and this may lead to merge conflicts. We don't want our users to deal with merge conflicts with entities and components in text files. This also makes it possible to use binary formats like MessagePack in the future.
Suppose user A made a change in XvcStore<XvcPath>
by adding a few files.
Another user B made another change to the project, by adding another set of files in another copy of the project.
This will lead to merge conflicts:
XvcEntity
counter will have different values in A and B's repositories.XvcStore<XvcPath>
will have different records in A and B's repositories.
Instead of saving and loading to monolithical files, XvcStore
saves and loads event logs.
There are two kind of events in a store:
Add(XvcEntity, T)
: Adds an elementT
to a store.Remove(XvcEntity)
: Removes the element with entity id.
These events are saved into files. When the store is loaded, all files after the last full snapshot are loaded and replayed.
When you add an item to a store, it saves the Add
event to a log.
These events are then put into a vector.
A BTreeMap
is also created by this vector.
When an item is deleted, a Remove
event is added to the event vector.
While loading, stores removes the elements with Remove
events from the BTreeMap
.
So the final set of elements doesn't contain the removed item.
The second problem with multiple branches is duplicate entities in separate branches. Xvc uses a counter to generate unique entity ids. When a store is loaded, it checks the last entity id in the event log and uses it as the starting point for the counter. But using this counter as is causes duplicate values in different branches. Xvc solves this by adding a random value to these counter values.
Since v0.5, XvcEntity
is a tuple of 64-bit integers. The first is loaded from
the disk and is an atomic counter. The second is a random value that is renewed
at every command invocation. Therefore we have a unique entity id for every run,
that's also sortable by the first value. Easy sorting with integers is sometimes
required for stable lists.
Inverted Index
Stores also have a inverted index for quick lookup.
They store value of T
as key and a list of entities that correspond to this key.
For example, when we have a path that we stored, it's a single operation to get the corresponding XvcEntity
and after this, all recorded metadata about this path is available.
All search, iteration and filtering functionality is performed using these two internal maps.
In summary, a store has four components.
- An immutable log of previous events:
Vec<Event<T>>
- A mutable log of current events:
Vec<Event<T>>
- A mutable map of the current data:
BTreeMap<XvcEntity, T>
- A mutable map of the entities from values:
BTreeMap<T, Vec<XvcEntity>>
Note that, when two branches perform the same operation, the event logs will be
different, as the random part of XvcEntity
is different. When two parties
branches merge, the inverted index may contain conflicting values. In this case,
a fsck
command is used to merge the store files and merge conflicting entity
ids.
Insert, update and delete operations affect mutable log and maps.
Queries, iteration and such non-destructive operations are done with the maps.
When loading, all log files are merged in immutable log.
No standard operation touches the event logs.
All log modifications are done outside of the normal worflow.
When saving, only the mutable log is saved.
Note that only can only be added to the log, they are not removed.
(See xvc fsck --merge-stores
for merging store files.)
Relationship Stores
XvcStore
keeps component-per-entity.
Each component is a flat structure that doesn't refer to other components.
Xvc also has relation stores that represent relationships between entities, and components. Similar to the database Entity-Relationship model, there are three kinds of the relationship store:
R11Store<T, U>
keeps two sets of components associated with the same entity.
It represents a 1-1 relationship between T
and U
.
It contains two XvcStore
s for each component type.
These two stores are indexed with the same XvcEntity
values.
For example, an R11Store<XvcPath, XvcMetadata>
keeps track of path metadata for the identical XvcEntity
keys.
R1NStore<T, U>
keeps parent-child relationships.
It represents a 1-N relationship between T
and U
.
On top of two XvcStore
s, this one keeps track of relationships with a third XvcStore<XvcEntity>
.
It lists which U
's are children of T
s.
For example, a value of XvcPipeline
can have multiple XvcStep
s.
These are represented with R1NStore<XvcPipeline, XvcStep>
.
This struct has parent-to-child
and child-to-parent
functions that can be used get children of a parent, or parent of child element.
The third type is RMNStore<T, U>
.
This one keeps arbitrary number of relationships between T
and U
.
Any number of T
s may correspond to any number of U
s.
This type of store keeps the relationships in two XvcStore<XvcEntity>
's.
Xvc Pipelines State Machine
Xvc pipelines use a state machine to track the progress of each step. Each step has a state that is updated as the pipeline is executed.
stateDiagram-v2 [*] --> Begin Begin --> DoneWithoutRunning: RunNever Begin --> WaitingDependencySteps: RunConditional WaitingDependencySteps --> WaitingDependencySteps: DependencyStepsRunning WaitingDependencySteps --> CheckingMissingDependencies: DependencyStepsFinishedSuccessfully WaitingDependencySteps --> Broken: DependencyStepsFinishedBroken WaitingDependencySteps --> CheckingMissingDependencies: DependencyStepsFinishedBrokenIgnored CheckingMissingDependencies --> CheckingMissingDependencies: MissingDependenciesIgnored CheckingMissingDependencies --> Broken: HasMissingDependencies CheckingMissingDependencies --> CheckingMissingOutputs: NoMissingDependencies CheckingMissingOutputs --> CheckingMissingOutputs: MissingOutputsIgnored CheckingMissingOutputs --> CheckingTimestamps: NoMissingOutputs CheckingMissingOutputs --> WaitingToRun: HasMissingOutputs CheckingTimestamps --> CheckingTimestamps: TimestampsIgnored CheckingTimestamps --> CheckingDependencyContentDigest: HasNoNewerDependencies CheckingTimestamps --> WaitingToRun: HasNewerDependencies CheckingDependencyContentDigest --> WaitingToRun: ContentDigestIgnored CheckingDependencyContentDigest --> DoneWithoutRunning: ContentDigestNotChanged CheckingDependencyContentDigest --> WaitingToRun: ContentDigestChanged DoneWithoutRunning --> Done: CompletedWithoutRunningStep WaitingToRun --> WaitingToRun: ProcessPoolFull WaitingToRun --> Running: StartProcess WaitingToRun --> Broken: CannotStartProcess Running --> Running: WaitProcess Running --> Broken: ProcessTimeout Running --> Done: ProcessCompletedSuccessfully Running --> Broken: ProcessReturnedNonZero Broken --> Broken: HasBroken Done --> Done: HasDone Done --> [*] Broken --> [*]
A step starts in the Begin
state.
It must wait for all its dependency steps if --when
is set to by_dependencies
(the default) in xvc pipeline step new
or xvc pipeline step update
.
If this option is set to never
, the step will never run and will move to the DoneWithoutRunning
state just after begin.
If this option is set to always
, the step will run regardless of the changes in the dependencies and will move to the
WaitingDependencySteps
even if dependencies are missing, broken, or have not changed.
If --when
option is set to by_dependencies
, the steps check the following conditions before running:
- All dependency steps must be in the
Done
state. - There should be no missing dependency files.
- There should be no broken dependency processes.
- Dependency files should be newer, or the content digest should be different from the step outputs.
If any of these conditions are met, the step will move to the WaitingDependencySteps
state.
Comparisons
To avoid unnecessary work, we need to find differences across versions.
What has changed between the previous version and this version of type T
?
Xvc is built bottom up, with vertical, long functions that do one thing.
For example, xvc file track
is written separately from xvc file recheck
, and the commonalities have arisen after these implementations.
I didn't start from traits and try to fit everything to a model. Instead, we began from concrete enums and structs. Then saw some of these share common functionality and thought to group this common functionality as a trait after implementing and refactoring concrete functions.
I saw that the diff
pattern across all comparison functions.
In xvc pipeline
, dependencies need to detect changes to decide whether to invalidate them.
In xvc file
, files and directories need to detect changes to decide whether they should be carried into the cache.
It's easy to make comparison/subtraction when the data types are numeric.
For a signed integer, you can get a single numeric value as diff with diff = a - b
.
For complex data structures, representing the change is not straightforward.
We keep track of everything in the repository in stores.
These serialize a type T
to a file, and get it back when needed.
Diff pattern works with these types.
Sometimes, there happens to be no record of something we have in the repository.
Sometimes, we only have only the record, and not the actual thing on disk.
The diff should also handle this.
Instead of trying to come up with wizardry, we decided to represent this with five conditions.
-
Identical
: When two things of the same typeT
are equal. Nothing has changed between the actual version and its record. -
RecordMissing { actual: T }
: If we have something on workspace, but can't find the respective record. For example, a new file is added to the workspace, butxvc file track
detects it for the first time. -
ActualMissing { record: T }
: We found a record in the store, but the corresponding file in the workspace is not where it should be. For example, a tracked file is deleted by the user, but the record is still there. -
Difference { record: T, actual: T }
: There is a record, but the actual file in workspace isn't identical with it. When a tracked file is changed, and its content now returns a different value, this can be reflected withDifference
. -
Skipped
: When the comparison seems unnecessary or irrelevant. For example, if we know a file hasn't changed by checking its metadata. In this case, we don't calculate its content digest and set it toSkipped
.
These five conditions are represented in Diff
type.
As an entity may have more than one component, a comparison may require multiple Diff
s.
For example, we may want to compare an XvcPath
, to see whether it has changed.
This requires comparing its XvcMetadata
, its ContentDigest
if it's a file, its CollectionDigest
if it's a
directory, etc.
Storages
Xvc uses storages to store content of the files. These storages are different from Git remotes. They don't contain Git history of a repository, but they can store contents of the files tracked by Xvc.
A storage uses the same content-addresses used in Xvc cache to store the files.
For example, if there is a file in Xvc repository that points to /b3/1886572424...defa/0.png
in local cache, this path will be used to identify the content in storage as well.
Additionally, Xvc stores storage event logs that lists which operations are performed on that storage. By using these event logs, it's possible to identify what has gone on with storages without checking the file lists. These event logs are also shared with the other users, and a user can identify which files are present in a storage even without a connection.
Basic Operations
All storages should support the following operations:
- Init to initialize a storage
- List to list the files available in the storage.
- Send to upload files from local cache to a storage.
- Receive to download files from a storage to local cache.
- Delete to delete file from a storage.
All these operations record a distinct event to the event log.
Events record the event, guid of the storage and the event content.
Event contents are like the following:
- Init creates the necessary directories and the guid file in a storage
- List includes the listing got from the storage. Once a list is retrieved from the storage, it's available for local operations. Most recent lists are starting point to determine files available in a storage.
- Send event contains the affected paths. These paths are added to storage file list.
- Receive event contains the affected paths. These paths are added to storage file list.
- Delete to delete multiple files at once. These paths are removed from storage file list.
Storage types
Local Storages
A local storage is a directory in the local file system. It may be a mount point shared with others, or another disk that you use for backups and sharing.
- Init uses
std::fs::copy
to copy the GUID file to the appropriate directory - List uses
std::fs::listdir
. - Send uses
std::fs::copy
with rayon. - Receive uses
std::fs::copy
with rayon. - Delete uses
std::fs::remove_file
with rayon.
Generic Storages
These storages define commands for each of the operations listed above.
It allows to run external programs such as rsync
, rclone
, s5cmd
.
For such storages, commands for the above operations must be defined and they will be run in separate processes.
This storage type offloads the responsibility of exact operations to the user.
The user is expected to supply the value following variables:
-
{URL}
: The url for the storage. This can be anything the commands to send/receive/list will accept. It's to build the paths with minor repeats. -
{STORAGE_DIR}
: You can separate the storage directory. -
{PATH}
: This is set by Xvc for each singular commands. It's a relative path to the local cache directory. -
{PROCESS_POOL_SIZE}
: This value is used to set the number of processes to perform operations. Setting this to1
makes all operations sequential. -
List Command
: A command to list the{URL}
. For example, forrsync --list-only {URL}{STORAGE_DIR}
-
Send Command
: A command to send a file to{URL}{STORAGE_DIR}
. It can use{URL}
and should use{PATH}
in the command. An example may bersync -a {PATH} {URL}{STORAGE_DIR}{PATH}
-
Receive Command
: A command to receive a file from a storage. It can use{URL}
and{STORAGE_DIR}
, and should use{PATH}
in the command. Example:rsync -a {URL}{STORAGE_DIR}{PATH} {PATH}
-
Delete Command
: A command to delete a file from the storage. It can use{URL}
and{STORAGE_DIR}
, and should use{PATH}
in the command. Example:ssh {URL} "rm {STORAGE_DIR}{PATH}"
Generic storages use these commands to create multiple processes to send/receive/delete files. It's not as fast as using other types because of the overhead involved, but its flexibility is useful.
Git and Xvc
Xvc aims to fill the gap Git leaves for certain workflows. These workflows involve large binary data that shouldn't be replicated in each repository.
Xvc tracks all its metadata on top of Git. In most cases, Xvc assumes the presence of a Git repository where the user tracks the history, text files, and metadata. However, the relationship between these should be clear and separate.
Xvc doesn't (and shouldn't) use Git more than a user could use manually. Our aim is not to replace Git operations with Xvc operations or tamper with the internal structure of the Git repository. When Xvc uses Git to track ECS or other metadata, the operations must be separate and sandwich Xvc operations.
-
Any Git operation that involves to checkout commits, branches, tags, or other references must come before any Xvc operation. As Xvc relies on the files tracked by Git, resuming any state for Xvc operations should be complete before these operations start.
-
Xvc helps to stage and commit certain files in
.xvc/
to Git. By default, any state-changing operation in Xvc adds a commit to Git. -
Xvc also helps to store this changed metadata in a new or existing branch. In this case, a checkout must be done before Xvc records the files.
sequenceDiagram User ->> Xvc: xvc --from-ref my-branch --to-branch another-branch file track large-dir/ Xvc ->> Git: git checkout my-branch Git ->> Xvc: branch = my-branch Xvc->> xvc-file: track large-dir/ xvc-file ->> Xvc: Ok. Saved the stores and metadata. Xvc ->> Git: Do we have user staged files? Git ->> Xvc: Yes. This and this. Xvc ->> Git: Stash them. Git ->> Xvc: Stashed user staged files. Xvc ->> Git: git checkout -b another-branch Git ->> Xvc: branch = another-branch Xvc ->> Git: git add .xvc/ Git ->> Xvc: added .xvc/ Xvc ->> Git: git commit -m "Commit after xvc file track" Xvc ->> Git: Unstash files that we have stashed
Note that if the user has some already staged files, these are stashed and unstashed to the requested branch.
This is a side effect of doing xvc commit operations on behalf of the user.
The other option is to report an error and quit if the user has the --to-branch
option set.
The behavior may change in the future.
For the time being, we will keep this stash-unstash operation for the user files.
One other issue is the library that we're going to use. I checked several options when I was writing auto-commit functionality.
At that time, I decided that the number of Git operations for each Xvc operation is less than five.
These can be done by creating a Git process.
The libraries are not 100% identical in features.
Even the most widely used libgit2 doesn't provide shallow clones, or it's not possible to use git stash --staged
.
The second reason for this is explainability. Instead of trying to explain to the user what we are doing with Git, we can report the commands we are running. The library interfaces are different from Git CLI. They need to be learned before reading the code. Using Git CLI is more dependable, observable, and understandable than trying to come up with a set of library calls.
Concepts
- Digest: A digest is a 32-byte numeric sequence to identify a file, content or any other data. Xvc uses different algorithms to generate this sequence.
- Associated Digest: This is a specific kind of digest associated with an entity. An entity can have more than one digests, like content digest or metadata digest. Xvc uses these different kinds of digests to avoid unnecessary digest calculations.
- Recheck: Recheck is the process of linking a file to its copy in Xvc cache. Xvc uses different methods to recheck a file, like copy, symlink, hardlink or reflink.
- Workspace: A project is broadly divided into 3 different types of directories.
.xvc/
contains the cache and metadata of the tracked files and pipelines,.git/
contains the git repository and the workspace contains the files that are tracked by either Xvc or git. It's the place where you do your work. - Carry-In: Carry-in is the process of adding a new version of a file to Xvc cache. It's analogous to
git commit
.
Digest
A numerical summary of an entity. In Xvc digests are 32-bytes, and produced by BLAKE3 by default.
See Associated Digest for different types of digests.
Associated Digest
There may be multiple digests associated with an entity like path, directory or dependency. An associated digest is all digests associated with an entity.
Metadata Digest
Files and directories have metadata.
Metadata shows information about creation, modification, access time of the file, or the size of it.
Metadata is OS dependent in most cases.
Xvc abstracts file and directory metadata with XvcMetadata
struct.
Metadata digest represents this abstraction in 32-bytes to compare changes in files and directories.
Content Digest
The content digest of a file is calculated by the data it contains. It calculates 32-bytes from the content. When content changes, this calculation result also change.
Collection Digest
Some entities in Xvc are composed of multiple elements. Examples are directories (composed of files), file lines, regex filter results, SQL query results etc. Instead trying to compare all elements, Xvc creates a 32-byte digest of the collection with the same conditions. For example, when a new file is added to a directory, its collection digest also changes. This is used keep track of changed directories easier than moving members around.
Development
Code and Documentation Conventions
- Xvc is spelled capitalized in documentation. It's Xvc, not XVC, not xvc.