Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lvm-localpv + FC + PureStorage #257

Open
metbog opened this issue Sep 5, 2023 · 9 comments
Open

lvm-localpv + FC + PureStorage #257

metbog opened this issue Sep 5, 2023 · 9 comments
Assignees
Labels
Backlog enhancement New feature or request

Comments

@metbog
Copy link

metbog commented Sep 5, 2023

Hi there,

We have a Kubernetes cluster (k8s) with BareMetal (BM) workers. These BM workers are connected via Fibre Channel (FC) to PureStorage FA. Our goal is to create a shared volume for our BM workers and use it with lvm-localpv.

PureStorage -> (BM1, BM2) -> /dev/mapper/sharevolume (attached to each BM worker via FC) -> PV -> VG1

Here is StorageClass:

allowVolumeExpansion: false
allowedTopologies:
- matchLabelExpressions:
  - key: kubernetes.io/hostname
    values:
    - bm-worker1
    - bm-worker2
    - bm-worker3
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-lvm
parameters:
  storage: lvm
  volgroup: test-pure-volume
provisioner: local.csi.openebs.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

One idea is to make it possible to reattach LVM to any of the BM workers because currently, it creates a Persistent Volume bound to one worker (where it was originally created). This limitation prevents pods from starting on other workers.

Is it possible to achieve this? Perhaps there is already a solution available for this issue?

@abhilashshetty04 abhilashshetty04 self-assigned this Sep 7, 2023
@abhilashshetty04
Copy link
Member

@metbog This seems like a shared vg feature requirement. We had tried shared vg previously but had to shelf the task due to technical roadblocks.

@dsharma-dc dsharma-dc added the enhancement New feature or request label Apr 17, 2024
@dsharma-dc
Copy link
Contributor

If I understand the requirement correctly, the same PVC is required to be used by applications on two(or more) different nodes, where the underlying PV is a shared vg managed by LVM, and assuming the lock managers required for shared vg are up and running on all worker nodes.
I don't see this feasible with current CSI provisioner to provide this kind of rwx capability.

@orville-wright
Copy link
Contributor

@metbog This seems like a shared vg feature requirement. We had tried shared vg previously but had to shelf the task due to technical roadblocks.

@abhilashshetty04 what was the technical roadblock that we identified?

@abhilashshetty04
Copy link
Member

@orville-wright , We had a previous employee of OpenEBS who attempted this feature. AFAIK, There were some roadblocks due to kernel semaphore dependencies.

This is the PR for your reference.
#184

@m-czarnik-exa
Copy link

If I understand the requirement correctly, the same PVC is required to be used by applications on two(or more) different nodes, where the underlying PV is a shared vg managed by LVM, and assuming the lock managers required for shared vg are up and running on all worker nodes. I don't see this feasible with current CSI provisioner to provide this kind of rwx capability.

Is there another way to achieve this with OpenEBS? For instance, VMware uses VMFS to reattach volumes or disks between VMs. I would like to find a way to use shared storage between nodes, and the only solution I've found so far involves replication. Is there an alternative approach?

@orville-wright
Copy link
Contributor

Hi @m-czarnik-exa - I run Product Mgmt for openEBS.
o.k. lets dig into your use case and figure out some stuff and see if we can help.

openEBS is primarily designed as a Hyper-converged vSAN system. This means that...

  1. We prefer the storage media to be installed locally & physically as disk media in your physical cluster nodes.
  2. We provide an NVMe-oF (TCP & RDMA) vSAN Fabric between all Hyper-Converged nodes in the cluster. (we call our vSAN fabric the Nexus).
    • The Nexus is a Block-mode storage Area Network (SAN) and works like a SAN within the cluster.
    • It can leverage 2 protocols (NVME-TCP and NVMe-RDMA (iWARP and RoCE)
    • RDMA is a new addition. It was prototype a couple of months ago. It is in dev/eng right now. Its significantly offloads CPU and memory by not reducing reliance on the network stack and CPU. All RDMA vSAN Fabric I/O is via RDMA directly between nodes.
  3. We have a Block Allocator stack that owns the physical disk media on each node and presents blocks devices into our DIskPool.
  4. We carve out PV's (LUN's) from our DiskPool (based on a PVC's) and each PV (LUN) is made addressable anywhere within the Nexus vSAN fabric address space, to any node running the NEXUS... via NVMe-TCP of NVMe-RDMA protocols.
  5. A PV can can only have 1 single reservation from 1 single node, and I/O can only be done from 1 node. That I/O doesn't have to be node local. Any node in the NEXUS can address and LUN that has been presented to the NEXUS. Basically an internal vSAN.
  6. From here, we build build a Filesystem ontop of the PV (LUN)... ext3/ext4, XFS, BTRFS. A node can then mount that PV just like a LUN get mounted.
  7. Since that PC (LUN) is a fabric attached block-mode kernel NVMe LBA Namespace (acting a a disk), your mount operation is treated as a local kernel block device operation. You node claims the PV, mounts the file system and all is well.

Operations 5 ... 7 are node exclusive operations. - Only 1 node can safely claim a LV (LUN) device, becasue that block device is presented into the kernel of the node. There is **no way to safely arbitrate multiple kernels claiming the same (PV) LUN. - i.e. no easy simple way... without the complexity of a clustered kernel block-device subsystem. YES these exist, but they're complex, slow, painful to work with, difficult to manage etc, etc. On-top of this... you would also need a Clustered File System with a distributed lock manger that understands that multiple nodes have physically claimed 1 single LUN and are sharing I/O to the same LUN. This would also require arbitration and a very complex clustered I/O Data-Plane. - YES these exist... but again, they are complex, many are not free/open source and they are horribly complex to deploy and manage.

There are ways to do things that are close to what you are asking for.

You mention VMFS.

  • [quote:] "VMware uses VMFS to reattach volumes or disks between VMs."

  • VMware VMFS is a vSAN like layer, but is not a real vSAN.

  • VMWare vSAN replaced VMFS and is a real Hyper-0Converged SDS vSAN layer... very similar to our NEXUS.

VMware vSAN and our NEXUS Fabric are similar in that any node in the vSAN fabric can address any block-mode disk device on the fabric. But... only 1 node can safely claim and do I/O to that LUN. - VMFS, VMware vSAN are not clustered block-device system or a Clustered File System that allows Multiple nodes to shared mount and shared write to exactly same single LUN at the same time.

openEBS

For openEBS, we have 5 Storage Engines that the user can choose to deploy. Each has different characteristics and different backend Block Allocator kernels. In all of the above... I am referring to openEBS Mayastor (see attached pics). - Not openEBS Local-PV LVM, becasue Mayastor is the only Storage Engine that currently contains the NEXUS vSAN.

  • we leverage our NEXUS vSAN to intelligently transport chunks of LBA's & Block namespaces between LUNS for replication, snapshots, Clones, and other general SAN block-mode operations.
  • This allows us to provide Write-Order Fidelity Synchronous I/O replication between N-way Replicated PV's (Luns)
  • We are working on enabling the K8s RWX PV flag (in co-operation with RedHat). This will allow multiple K8s nodes to take a shared reservation on 1 single PV, but this is extremely dangerous and easily leads to data-loss. We're only going to support this for KubeVirt Live Migration, when migrating a KubeVirt KVM VM from 1 node to another. Since KubeVirt, KVM and QEMU will safely arbitrate shared lun I/O.

openEBS Local-PV LVM utilizes the LVM2 kernel. (i.e. PE, PV, VG, LV structures) but does not currently utilize the NEXUS vSAN fabric. All I/O is Node-local.

openES Local PV LVM

Our LVM2 kernel is very mature, rock solid and high performance. It does inherit the native LVM2 concept of a Clustered VG (Volume Group) which allows multiples nodes to share access to a 1 single VG. This is somewhat like VMFS or VMware vSAN. You can extend LVM2 to work as in a Clustered LVM mode, but we have not prototyped or tested this.

So... after all of this... as a starting primer... what problem are you trying to solve? when you say the words...

  • " I would like to find a way to use shared storage between nodes".


- openEBS Mayastor (Fabric overview)

overview_fabric


- openEBS Mayastor (internal kernel)

SPDK _Structure_components_v4

I'm looping in @tiagolobocastro @niladrih and @abhilashshetty04 for any further commentary.

@m-czarnik-exa
Copy link

m-czarnik-exa commented May 31, 2024

@orville-wright

First of all, thank you for this elaborate explanation :)

  • So... after all of this... as a starting primer... what problem are you trying to solve?

Basically what I'm trying to achieve is a migration from VMware CSI to open source virtualization platforms like proxmox/opennebula etc... or possibly a bare metal solution.

The setup that I'm trying to configure (sorry for simplifications but I don't have a deep expertise in the field of storage) is to connect let's say, tree k8s nodes to SAN storage with fc or iscsi (to allocate block storage for these nodes) and be able to attach pv's created on one node, to another node (VMware CSI just reattach vmkd from one vm to another when pod with pvc starts on other node). Taking into account that the SAN storage has configured RAID and that I would like to use velero for backups, I don't need to replicate data from one node to another because it will affect performance. What I'm looking for is CSI that will handle shared storage from disk array between nodes and that will simultaneously be as fast as possible.

@dsharma-dc
Copy link
Contributor

dsharma-dc commented Jun 3, 2024

@m-czarnik-exa Today it won't be possible to let a PV(Persistent Volume) be used by multiple nodes(or by node different than where PV is created). With LVM-localPV engine, a PV represents the LVM logical volume created on that node.

There is a slightly similar use case of using an LVM VG as shared so that multiple nodes can create PVs on the same VG. This isn't complete yet as discussed and designed here: #134

@avishnu
Copy link
Member

avishnu commented Sep 19, 2024

This requirement needs the LVM shared VG support, this will be tracked post #134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants