-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zenfs gc sekhar feature #4
base: master
Are you sure you want to change the base?
Changes from all commits
96954ab
6d6e5d7
9aa4ccd
90cbf8b
b5674a2
e7435f3
313ca13
5890cac
f3df7ed
c9a45ec
f6210df
7f67872
afb8924
15e752e
583567c
69c733b
73795f0
3685f5b
eac8b8b
e9bc3ff
aaa1695
efe7c4d
c9b2e52
f53fdbb
a8ea09f
e529a1e
a4eea36
010b413
f023556
8476c8d
53b40b6
83cae4a
459696c
6dc51ef
e22f9a3
530b7a0
d703522
2a10511
2f85c63
26eccac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
BasedOnStyle: Google | ||
PointerAlignment: Left |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ | |
#include <utility> | ||
#include <vector> | ||
|
||
#include "fs_zenfs.h" | ||
#include "rocksdb/env.h" | ||
#include "util/coding.h" | ||
#include "zbd_zenfs.h" | ||
|
@@ -632,6 +633,174 @@ size_t ZonedRandomAccessFile::GetUniqueId(char* id, size_t max_size) const { | |
return zoneFile_->GetUniqueId(id, max_size); | ||
} | ||
|
||
// This is a helper function to read data from a source zone from a read | ||
// position -> read_pos. | ||
IOStatus ZenFSGCWorker::ReadExtent(Slice* buf, uint64_t read_pos, | ||
Zone* zone_src) { | ||
int f = zbd_->GetReadFD(); | ||
const char* data = buf->data(); | ||
size_t read = 0; | ||
size_t to_read = buf->size(); | ||
int ret; | ||
|
||
if (read_pos >= zone_src->wp_) { | ||
// EOF | ||
buf->clear(); | ||
return IOStatus::OK(); | ||
} | ||
|
||
if ((read_pos + to_read) > (zone_src->start_ + zone_src->max_capacity_)) { | ||
return IOStatus::IOError("Read across zone"); | ||
} | ||
|
||
while (read < to_read) { | ||
ret = pread(f, (void*)(data + read), to_read - read, read_pos); | ||
|
||
if (ret == -1 && errno == EINTR) continue; | ||
if (ret < 0) return IOStatus::IOError("Read failed"); | ||
|
||
read += ret; | ||
read_pos += ret; | ||
} | ||
|
||
return IOStatus::OK(); | ||
} | ||
|
||
// This is a heavy weight function. There is going to be a high | ||
// traffic activity via the PCIe channel to the ZNS SSD because | ||
// of the read/write to zones which needs to be issued. We need | ||
// some better ideas later to bring in efficiency, something like | ||
// "simple copy" or ideas in those lines. | ||
IOStatus ZenFSGCWorker::MoveValidDataToNewDestZone() { | ||
std::vector<Zone*>::iterator zone_it; | ||
std::vector<ZoneExtent*>::iterator ext_it; | ||
std::vector<ZoneExtent*>::iterator e_it; | ||
IOStatus s; | ||
uint64_t r_pos; | ||
uint32_t size; | ||
uint32_t long_ext_size; | ||
uint64_t new_start; | ||
void* align_buf; | ||
int dont_read = 0; | ||
|
||
// Sort the Extent list in decreasing order. | ||
std::sort(extent_list.begin(), extent_list.end(), | ||
[](ZoneExtent* ext1, ZoneExtent* ext2) { | ||
return ext1->length_ > ext2->length_; | ||
}); | ||
|
||
// Get the size of the largest extent. | ||
long_ext_size = extent_list[0]->length_; | ||
|
||
// Allocate a aligned buffer with size of the largest extent. | ||
// We have to issue pread from the source zones so we need a | ||
// buffer. | ||
int ret = posix_memalign(&align_buf, sysconf(_SC_PAGESIZE), long_ext_size); | ||
if (ret) return IOStatus::IOError("Failed to allocate aligned memory"); | ||
|
||
zone_it = dst_zone_list.begin(); | ||
for (ext_it = extent_list.begin(); ext_it != extent_list.end();) { | ||
ZoneExtent* ext; | ||
Zone* zone_dst; | ||
|
||
ext = *ext_it; | ||
zone_dst = *zone_it; | ||
|
||
// Set the position and length in the source zone to | ||
// read the data. | ||
r_pos = ext->start_; | ||
size = ext->length_; | ||
Slice buf((const char*)align_buf, size); | ||
|
||
if (!dont_read) { | ||
s = ReadExtent(&buf, r_pos, ext->zone_); | ||
if (!s.ok()) { | ||
free(align_buf); | ||
return s; | ||
} | ||
} | ||
|
||
// Store the new starting position for the extent | ||
// which will be later made persistent. | ||
new_start = zone_dst->wp_; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there could be multiple threads appending data to a zone, how could we ensure that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes that's a valid concern. We need to figure out a way to serialize that. I did not find a way to lock the zone and operate on it or maybe there is a way. Perhaps we need a bit more thought on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. As far as I can see, in ZenFS, there is only one writable file in one zone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm. OK, we have to add that logic in the function GetDestZoneToMoveValidData() and the function should push zones to the dst_zone_list member variable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As in https://github.com/bzbd/zenfs/blob/master/fs/zbd_zenfs.cc#L518, allocate a zone without setting |
||
|
||
// Write the valid data where were read from the | ||
// source zone to the destination zone. | ||
s = zone_dst->Append((char*)align_buf, size); | ||
if (s.ok()) { | ||
// Data was written to the new zone, so the extent | ||
// will have a new starting position. No need to | ||
// change the length of the extent as it will be the | ||
// same. | ||
ext->start_ = new_start; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same lock issue here. When we are changing extent information, it is possible that some other operating is in the process of reading where the extent is. We need to apply some kind of lock throughout the GC process. |
||
|
||
// The extent was moved to a new zone so change the | ||
// resident zone parameter of the extent. | ||
ext->zone_ = zone_dst; | ||
|
||
// Current extent was written so now fetch the next extent. | ||
ext_it++; | ||
memset(align_buf, 0, long_ext_size); | ||
dont_read = 0; | ||
continue; | ||
} | ||
|
||
if (s == IOStatus::NoSpace()) { | ||
// Data was already read before, no need to read it again | ||
dont_read = 1; | ||
skyzh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// The current zone cannot fit this extent because of lack | ||
// of space, so get the next zone from the dst_zone_list. | ||
zone_it++; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we consider the case that there is not enough destination zone? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that possibility could arise. The function GetDestZoneToMoveValidData() should do that math and figure out how many zones are needed to fit all the extents(cold valid data) and called AllocateZone() for moving the valid cold data. If the desired amount of zones to fit all the cold valid data( or extents ) is not possible to allocate then it will send IOStatus error that "Not enough zones". In that case, the GC cannot progress. Maybe a slight design change can be done that we try to move only that much valid data, for which we have space(zones). Since this is a periodic background reclaim, in the next cycles, the remaining zones could be reclaimed. Just a thought, but this could make the implementation intricate. |
||
} | ||
|
||
// There was an error so cannot proceed and simply | ||
// we return the status. | ||
if (s == IOStatus::IOError()) { | ||
// If memory was allocated, we free it before returning. | ||
free(align_buf); | ||
|
||
return s; | ||
} | ||
} | ||
|
||
// Free the allocated buffer before returning OK status. | ||
free(align_buf); | ||
|
||
return IOStatus::OK(); | ||
skyzh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
IOStatus ZenFSGCWorker::UpdateMetadataAfterMerge() { | ||
std::vector<ZoneFile*>::iterator zone_file_it; | ||
IOStatus s; | ||
for (zone_file_it = files_moved_to_dst_zone.begin(); | ||
zone_file_it != files_moved_to_dst_zone.end(); zone_file_it++) { | ||
ZoneFile* file_moved; | ||
file_moved = *zone_file_it; | ||
|
||
// What if the file is deleted before coming here? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe we should lock the metadata and scan if there is any new records before updating metadata. This may require more careful design work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
SyncFileMetadata() function calls the PersistRecord() function which holds lock before updating the metadata so I did not use any lock. Maybe you are right. I was thinking maybe we can remove the UpdateMetadataAfterMerge() completely and somehow figure out to update the metadata changes in the MoveValidDataToNewDestZone() function itself. It would have been less code and easier because why to have one more function. But to do that we need to get the reference to the files whose extents have been moved. From the files, its easier to get the reference to the extents but not sure how to get the reference of the files from the extents. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good. If possible, please update the design doc for new information. |
||
// We don't have to update the metadata if the file | ||
// is deleted, because once deleted the metadata is | ||
// already synced in the DeleteFile() function. | ||
fs->files_mtx_.lock(); | ||
if (fs->files_.find(file_moved->filename_) == fs->files_.end()) { | ||
// Should we erase this because this is | ||
// already deleted ? | ||
files_moved_to_dst_zone.erase(zone_file_it); | ||
fs->files_mtx_.unlock(); | ||
continue; | ||
} | ||
fs->files_mtx_.unlock(); | ||
|
||
// TODO: Need to give a thought about Changlong's comment | ||
// on how to trash/deal with old metadata after new changes. | ||
s = fs->SyncFileMetadata(file_moved); | ||
if (!s.ok()) return s; | ||
} | ||
|
||
return IOStatus::OK(); | ||
} | ||
|
||
} // namespace ROCKSDB_NAMESPACE | ||
|
||
#endif // !defined(ROCKSDB_LITE) && !defined(OS_WIN) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,8 @@ | |
|
||
namespace ROCKSDB_NAMESPACE { | ||
|
||
class ZenFS; | ||
|
||
class ZoneExtent { | ||
public: | ||
uint64_t start_; | ||
|
@@ -37,6 +39,8 @@ class ZoneExtent { | |
}; | ||
|
||
class ZoneFile { | ||
friend class ZenFSGCWorker; | ||
|
||
protected: | ||
ZonedBlockDevice* zbd_; | ||
std::vector<ZoneExtent*> extents_; | ||
|
@@ -226,6 +230,40 @@ class ZonedRandomAccessFile : public FSRandomAccessFile { | |
size_t GetUniqueId(char* id, size_t max_size) const override; | ||
}; | ||
|
||
class ZenFSGCWorker { | ||
// if below variable 'fs' cannot reference to ZenFS, then we have to populate | ||
// it in the initiator!!! | ||
ZenFS* fs; // friend class, used to check the zonefile name | ||
|
||
ZonedBlockDevice* zbd_; | ||
|
||
std::map<Zone*, uint64_t> | ||
zone_residue; // record each zone's residual data size | ||
std::vector<ZoneFile*> files_moved_to_dst_zone; | ||
|
||
std::atomic<uint64_t> | ||
total_residue_; // Is atomic necessary since only one thread at one time? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems that this variable is not used in this patch? |
||
|
||
std::vector<Zone*> merge_zone_list; | ||
std::vector<ZoneExtent*> extent_list; // To store the Extent list of marked | ||
// zone, used to move the Extent data | ||
std::vector<Zone*> dst_zone_list; // It is possible for residual data to be | ||
// larger than Zone Capacity | ||
|
||
public: | ||
explicit ZenFSGCWorker(); // need to init | ||
|
||
void CheckZoneValidResidualData(); // work for below functions | ||
|
||
std::vector<Zone*> MarkZonesToMergeData(); | ||
std::vector<Zone*> GetDestZoneToMoveValidData(uint64_t ttl_residue); | ||
IOStatus MoveValidDataToNewDestZone(); // extent_list and dst_zone_list | ||
void ZoneResetToReclaim(); // merge_zone_list | ||
IOStatus UpdateMetadataAfterMerge(); // files_moved_to_dst_zone | ||
IOStatus ReadExtent(Slice* buf, uint64_t read_pos, | ||
Zone* zone_src); // Helper function | ||
}; | ||
|
||
} // namespace ROCKSDB_NAMESPACE | ||
|
||
#endif // !defined(ROCKSDB_LITE) && defined(OS_LINUX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is okay to change the extent order of a file? File content is store sequentially as its order of extent. Say that we have three extents, 1: 1KB, 2: 1KB, 3: 2KB. When we read from the 2048 byte, we should begin at extent 3. After we sort the extents, the file content is changed.I have mistaken this extent list for file extent list. So I have another question: How to construct the contents ofZenFSGCWorker
? Do we have any plan on this?