Pachyderm 1.1 在 7 月份时候就发布了,Pychyderm 是一个容器化的数据池,可以让你使用容器来存储和分析数据。
该版本包含众多改进内容,详细列表如下:
特征:
Data Provenance, which tracks the flow of data as it’s analyzed
FlushCommit, which tracks commits forward downstream results computed from them
DeleteAll, which restores the cluster to factory settings
More featureful data partitioning (map, reduce and global methods)
Explicit incrementality
Better support for dynamic membership (nodes leaving and entering the cluster)
Commit IDs are now present as env vars for jobs
Deletes and reads now work during job execution
pachctl inspect-* now returns much more information about the inspected objects
PipelineInfos now contain a count of job outcomes for the pipeline
Fixes to pachyderm and bazil.org/fuse to support writing a larger number of files
Jobs now report their end times as well as their start times
Jobs have a pulling state for when the container is being pulled
Put-file now accepts a -f flag for easier puts
Cluster restarts now work, even if kubernetes is restarted as well
Support for json and binary delimiters in data chunking
Manifests now reference specific pachyderm container version making deployment more bulletproof
Readiness checks for pachd which makes deployment more bulletproof
Kubernetes jobs are now created in the same namespace pachd is deployed in
Support for pipeline DAGs that aren’t transitive reductions.
Appending to files now works in jobs, from shell scripts you can do >>
Network traffic is reduced with object stores by taking advantage of content addressability
Transforms now have a Debug field which turns on debug logging for the job
Pachctl can now be installed via Homebrew on macOS or apt on Ubuntu
ListJob now orders jobs by creation time
Openshift Origin is now supported as a deployment platform
内容:
Webscraper example
Neural net example with Tensor Flow
Wordcount example
Bug 修复:
False positive on running pipelines
Makefile bulletproofing to make sure things are installed when they’re needed
Races within the FUSE driver
In 1.0 it was possible to get duplicate job ids which, that should be fixed now
Pipelines could get stuck in the pulling state after being recreated several times
Map jobs no longer return when sharded unless the files are actually empty
The fuse driver could encounter a bounds error during execution, no longer
Pipelines no longer get stuck in restarting state when the cluster is restarted
Failed jobs were being marked failed too early resulting in a race condition
Jobs could get stuck in running when they had failed
Pachd could panic due to membership changes
Starting a commit with a nonexistant parent now errors instead of silently failing
Previously pachd nodes would crash when deleting a watched repo
Jobs now get recreated if you delete and recreate a pipeline
Getting files from non existant commits gives a nicer error message
RunPipeline would fail to create a new job if the pipeline had already run
FUSE no longer chokes if a commit is closed after the mount happened
GCE/AWS backends have been made a lot more reliable
Tests:
From 1.0.0 to 1.1.0 we’ve gone from 70 tests to 120, a 71% increase.
下载地址:https://github.com/pachyderm/pachyderm
来自:开源中国社区

