Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zenfs gc sekhar feature #4

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
96954ab
Update io_zenfs.h
soumendus Jun 27, 2021
6d6e5d7
Update io_zenfs.h
soumendus Jun 27, 2021
9aa4ccd
Update io_zenfs.cc
soumendus Jun 27, 2021
90cbf8b
Update io_zenfs.cc
soumendus Jun 27, 2021
b5674a2
Update fs_zenfs.h
soumendus Jun 27, 2021
e7435f3
Update io_zenfs.h
soumendus Jun 27, 2021
313ca13
Indentation of file metadata function
soumendus Jun 27, 2021
5890cac
Removed a typo in UpdateMetadataAfterMerge()
soumendus Jun 27, 2021
f3df7ed
Fixed Misplaced unlock
soumendus Jun 27, 2021
c9a45ec
No need for ext_to_zone_map, removing it
soumendus Jun 29, 2021
f6210df
Changed return type from void to IOStatus
soumendus Jun 29, 2021
7f67872
Added MoveValidDataToNewDestZone() function
soumendus Jun 29, 2021
afb8924
Modified UpdateMetadataAfterMerge() function to support IOStatus retu…
soumendus Jun 29, 2021
15e752e
Changed void from IOStatus
soumendus Jun 29, 2021
583567c
Missing variable IOStatus
soumendus Jun 29, 2021
69c733b
TODO comment
soumendus Jun 30, 2021
73795f0
Added ReadExtent() function declaration
soumendus Jul 2, 2021
3685f5b
Added definition of ReadExtent()
soumendus Jul 2, 2021
eac8b8b
Modified MoveValidDataToNewDestZone()
soumendus Jul 2, 2021
e9bc3ff
Type cast to char*
soumendus Jul 2, 2021
aaa1695
TODO comment
soumendus Jul 2, 2021
efe7c4d
Removed comment
soumendus Jul 2, 2021
c9b2e52
Modified MoveValidDataToNewDestZone() and added helper ReadExtent()
soumendus Jul 6, 2021
f53fdbb
fixed Slice param
soumendus Jul 6, 2021
a8ea09f
Comment
royguo Jul 6, 2021
e529a1e
comment
royguo Jul 6, 2021
a4eea36
ci: specify pointer alignment for clang-format 12
skyzh Jul 7, 2021
010b413
Merge branch 'master' into zenfs_gc_sekhar_feature
skyzh Jul 7, 2021
f023556
format code and merge master
skyzh Jul 7, 2021
8476c8d
Fix for memory leak comment by Alex Chi
soumendus Jul 7, 2021
53b40b6
Fixed toggling of the flag dont_read
soumendus Jul 7, 2021
83cae4a
Removed ptr null check before delete
soumendus Jul 8, 2021
459696c
Fix: dont_read flag set to zero
soumendus Jul 8, 2021
6dc51ef
Use the IOStatus variable
soumendus Jul 8, 2021
e22f9a3
Moved status check inside dont_read branch
soumendus Jul 9, 2021
530b7a0
Aligned address for zone append
soumendus Jul 9, 2021
d703522
Removed code under #if 0
soumendus Jul 9, 2021
2a10511
Added comment
soumendus Jul 9, 2021
2f85c63
Alignment
soumendus Jul 9, 2021
26eccac
clang-format aligned
soumendus Jul 11, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
BasedOnStyle: Google
PointerAlignment: Left
1 change: 1 addition & 0 deletions fs/fs_zenfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ class ZenMetaLog {
};

class ZenFS : public FileSystemWrapper {
friend class ZenFSGCWorker;
ZonedBlockDevice* zbd_;
std::map<std::string, ZoneFile*> files_;
std::mutex files_mtx_;
Expand Down
169 changes: 169 additions & 0 deletions fs/io_zenfs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
#include <utility>
#include <vector>

#include "fs_zenfs.h"
#include "rocksdb/env.h"
#include "util/coding.h"
#include "zbd_zenfs.h"
Expand Down Expand Up @@ -632,6 +633,174 @@ size_t ZonedRandomAccessFile::GetUniqueId(char* id, size_t max_size) const {
return zoneFile_->GetUniqueId(id, max_size);
}

// This is a helper function to read data from a source zone from a read
// position -> read_pos.
IOStatus ZenFSGCWorker::ReadExtent(Slice* buf, uint64_t read_pos,
Zone* zone_src) {
int f = zbd_->GetReadFD();
const char* data = buf->data();
size_t read = 0;
size_t to_read = buf->size();
int ret;

if (read_pos >= zone_src->wp_) {
// EOF
buf->clear();
return IOStatus::OK();
}

if ((read_pos + to_read) > (zone_src->start_ + zone_src->max_capacity_)) {
return IOStatus::IOError("Read across zone");
}

while (read < to_read) {
ret = pread(f, (void*)(data + read), to_read - read, read_pos);

if (ret == -1 && errno == EINTR) continue;
if (ret < 0) return IOStatus::IOError("Read failed");

read += ret;
read_pos += ret;
}

return IOStatus::OK();
}

// This is a heavy weight function. There is going to be a high
// traffic activity via the PCIe channel to the ZNS SSD because
// of the read/write to zones which needs to be issued. We need
// some better ideas later to bring in efficiency, something like
// "simple copy" or ideas in those lines.
IOStatus ZenFSGCWorker::MoveValidDataToNewDestZone() {
std::vector<Zone*>::iterator zone_it;
std::vector<ZoneExtent*>::iterator ext_it;
std::vector<ZoneExtent*>::iterator e_it;
IOStatus s;
uint64_t r_pos;
uint32_t size;
uint32_t long_ext_size;
uint64_t new_start;
void* align_buf;
int dont_read = 0;

// Sort the Extent list in decreasing order.
std::sort(extent_list.begin(), extent_list.end(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is okay to change the extent order of a file? File content is store sequentially as its order of extent. Say that we have three extents, 1: 1KB, 2: 1KB, 3: 2KB. When we read from the 2048 byte, we should begin at extent 3. After we sort the extents, the file content is changed. I have mistaken this extent list for file extent list. So I have another question: How to construct the contents of ZenFSGCWorker? Do we have any plan on this?

[](ZoneExtent* ext1, ZoneExtent* ext2) {
return ext1->length_ > ext2->length_;
});

// Get the size of the largest extent.
long_ext_size = extent_list[0]->length_;

// Allocate a aligned buffer with size of the largest extent.
// We have to issue pread from the source zones so we need a
// buffer.
int ret = posix_memalign(&align_buf, sysconf(_SC_PAGESIZE), long_ext_size);
if (ret) return IOStatus::IOError("Failed to allocate aligned memory");

zone_it = dst_zone_list.begin();
for (ext_it = extent_list.begin(); ext_it != extent_list.end();) {
ZoneExtent* ext;
Zone* zone_dst;

ext = *ext_it;
zone_dst = *zone_it;

// Set the position and length in the source zone to
// read the data.
r_pos = ext->start_;
size = ext->length_;
Slice buf((const char*)align_buf, size);

if (!dont_read) {
s = ReadExtent(&buf, r_pos, ext->zone_);
if (!s.ok()) {
free(align_buf);
return s;
}
}

// Store the new starting position for the extent
// which will be later made persistent.
new_start = zone_dst->wp_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there could be multiple threads appending data to a zone, how could we ensure that new_start is really at zone_dst->wp_?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there could be multiple threads appending data to a zone, how could we ensure that new_start is really at zone_dst->wp_?

Yes that's a valid concern. We need to figure out a way to serialize that. I did not find a way to lock the zone and operate on it or maybe there is a way. Perhaps we need a bit more thought on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As far as I can see, in ZenFS, there is only one writable file in one zone. AllocateZone will mark a zone as being written after allocating, making this zone not being able to used by other files for write. Maybe allocating free zone in dst_zone_list could solve this issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As far as I can see, in ZenFS, there is only one writable file in one zone. AllocateZone will mark a zone as being written after allocating, making this zone not being able to used by other files for write. Maybe allocating free zone in dst_zone_list could solve this issue.

Hmm. OK, we have to add that logic in the function GetDestZoneToMoveValidData() and the function should push zones to the dst_zone_list member variable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As far as I can see, in ZenFS, there is only one writable file in one zone. AllocateZone will mark a zone as being written after allocating, making this zone not being able to used by other files for write. Maybe allocating free zone in dst_zone_list could solve this issue.

Hmm. OK, we have to add that logic in the function GetDestZoneToMoveValidData() and the function should push zones to the dst_zone_list member variable.

  • For getting the list of destination zones, in the function GetDestZoneToMoveValidData(), we may need to call AllocateZone() with open_for_write_ set to false or write another version of AllocateZone() ex. AllocateZoneForGC() which sets that flag to false. Then in the function MoveValidDataToNewDestinationZone(), after we fetch the current parameters of the destination zones like wp_ (write pointer), we can set the flag open_for_write_ back to true and then start appending to that zone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in https://github.com/bzbd/zenfs/blob/master/fs/zbd_zenfs.cc#L518, allocate a zone without setting open_for_write to true may lead to some issues. The zone might be seen in "open but not allocated" state, and will be allocated to other files. Therefore, I don't think this would work.


// Write the valid data where were read from the
// source zone to the destination zone.
s = zone_dst->Append((char*)align_buf, size);
if (s.ok()) {
// Data was written to the new zone, so the extent
// will have a new starting position. No need to
// change the length of the extent as it will be the
// same.
ext->start_ = new_start;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same lock issue here. When we are changing extent information, it is possible that some other operating is in the process of reading where the extent is. We need to apply some kind of lock throughout the GC process.


// The extent was moved to a new zone so change the
// resident zone parameter of the extent.
ext->zone_ = zone_dst;

// Current extent was written so now fetch the next extent.
ext_it++;
memset(align_buf, 0, long_ext_size);
dont_read = 0;
continue;
}

if (s == IOStatus::NoSpace()) {
// Data was already read before, no need to read it again
dont_read = 1;

// The current zone cannot fit this extent because of lack
// of space, so get the next zone from the dst_zone_list.
zone_it++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider the case that there is not enough destination zone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider the case that there is not enough destination zone?

Yes, that possibility could arise. The function GetDestZoneToMoveValidData() should do that math and figure out how many zones are needed to fit all the extents(cold valid data) and called AllocateZone() for moving the valid cold data. If the desired amount of zones to fit all the cold valid data( or extents ) is not possible to allocate then it will send IOStatus error that "Not enough zones". In that case, the GC cannot progress. Maybe a slight design change can be done that we try to move only that much valid data, for which we have space(zones). Since this is a periodic background reclaim, in the next cycles, the remaining zones could be reclaimed. Just a thought, but this could make the implementation intricate.

}

// There was an error so cannot proceed and simply
// we return the status.
if (s == IOStatus::IOError()) {
// If memory was allocated, we free it before returning.
free(align_buf);

return s;
}
}

// Free the allocated buffer before returning OK status.
free(align_buf);

return IOStatus::OK();
}

IOStatus ZenFSGCWorker::UpdateMetadataAfterMerge() {
std::vector<ZoneFile*>::iterator zone_file_it;
IOStatus s;
for (zone_file_it = files_moved_to_dst_zone.begin();
zone_file_it != files_moved_to_dst_zone.end(); zone_file_it++) {
ZoneFile* file_moved;
file_moved = *zone_file_it;

// What if the file is deleted before coming here?
Copy link
Collaborator

@skyzh skyzh Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should lock the metadata and scan if there is any new records before updating metadata. This may require more careful design work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOStatus s; is never assigned or modified throughout this function...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should lock the metadata and scan if there is any new records before updating metadata. This may require more careful design work.

SyncFileMetadata() function calls the PersistRecord() function which holds lock before updating the metadata so I did not use any lock. Maybe you are right. I was thinking maybe we can remove the UpdateMetadataAfterMerge() completely and somehow figure out to update the metadata changes in the MoveValidDataToNewDestZone() function itself. It would have been less code and easier because why to have one more function. But to do that we need to get the reference to the files whose extents have been moved. From the files, its easier to get the reference to the extents but not sure how to get the reference of the files from the extents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. If possible, please update the design doc for new information.

// We don't have to update the metadata if the file
// is deleted, because once deleted the metadata is
// already synced in the DeleteFile() function.
fs->files_mtx_.lock();
if (fs->files_.find(file_moved->filename_) == fs->files_.end()) {
// Should we erase this because this is
// already deleted ?
files_moved_to_dst_zone.erase(zone_file_it);
fs->files_mtx_.unlock();
continue;
}
fs->files_mtx_.unlock();

// TODO: Need to give a thought about Changlong's comment
// on how to trash/deal with old metadata after new changes.
s = fs->SyncFileMetadata(file_moved);
if (!s.ok()) return s;
}

return IOStatus::OK();
}

} // namespace ROCKSDB_NAMESPACE

#endif // !defined(ROCKSDB_LITE) && !defined(OS_WIN)
38 changes: 38 additions & 0 deletions fs/io_zenfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@

namespace ROCKSDB_NAMESPACE {

class ZenFS;

class ZoneExtent {
public:
uint64_t start_;
Expand All @@ -37,6 +39,8 @@ class ZoneExtent {
};

class ZoneFile {
friend class ZenFSGCWorker;

protected:
ZonedBlockDevice* zbd_;
std::vector<ZoneExtent*> extents_;
Expand Down Expand Up @@ -226,6 +230,40 @@ class ZonedRandomAccessFile : public FSRandomAccessFile {
size_t GetUniqueId(char* id, size_t max_size) const override;
};

class ZenFSGCWorker {
// if below variable 'fs' cannot reference to ZenFS, then we have to populate
// it in the initiator!!!
ZenFS* fs; // friend class, used to check the zonefile name

ZonedBlockDevice* zbd_;

std::map<Zone*, uint64_t>
zone_residue; // record each zone's residual data size
std::vector<ZoneFile*> files_moved_to_dst_zone;

std::atomic<uint64_t>
total_residue_; // Is atomic necessary since only one thread at one time?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this variable is not used in this patch?


std::vector<Zone*> merge_zone_list;
std::vector<ZoneExtent*> extent_list; // To store the Extent list of marked
// zone, used to move the Extent data
std::vector<Zone*> dst_zone_list; // It is possible for residual data to be
// larger than Zone Capacity

public:
explicit ZenFSGCWorker(); // need to init

void CheckZoneValidResidualData(); // work for below functions

std::vector<Zone*> MarkZonesToMergeData();
std::vector<Zone*> GetDestZoneToMoveValidData(uint64_t ttl_residue);
IOStatus MoveValidDataToNewDestZone(); // extent_list and dst_zone_list
void ZoneResetToReclaim(); // merge_zone_list
IOStatus UpdateMetadataAfterMerge(); // files_moved_to_dst_zone
IOStatus ReadExtent(Slice* buf, uint64_t read_pos,
Zone* zone_src); // Helper function
};

} // namespace ROCKSDB_NAMESPACE

#endif // !defined(ROCKSDB_LITE) && defined(OS_LINUX)