Paper notes and evaluation
Go to file
Johannes Wünsche 6c000bd234
add bubble/scatter plot idea
2022-09-08 14:52:30 +02:00
Measurements add a few base measurements 2022-08-23 15:35:18 +02:00
.gitignore add gitignore 2022-07-07 17:54:10 +02:00 add bubble/scatter plot idea 2022-09-08 14:52:30 +02:00 update sequential write SSD - large size 2022-08-24 12:48:31 +02:00 notes: add tiered data storage model 2022-08-22 17:47:02 +02:00

Implementation related papers

Background Knowledge

HPC Bottlenecks 1

Paper Summary and Survey

We try to divide all papers into three categories:

- Heuristics (smart selections)
- Machine Learning (anything todo with training)
- Hinting (possibly combined with one of the former)


Block-Level Data Migration in Tiered Storage System 2

  • eviction and prefetching of data
  • older paper from 2010; the "not focused" reasearch they talk about is now in focus anyway
  • upgrade from frequency based methods
  • metrics based approach with:
    • write/read frequency
    • write/read size
    • average write/read size
    • data placement
    • relevance between data blocks (defined by)
    • block valuation based on the aforementioned factors and 2 compensation coefficients
  • getting the formulas all translated could be quite a bit of work but no roadblock
  • no evaluation done in the paper

AutoTiering: Automatic data placement manager in multi-tier all-flash datacenter 3

  • focus on migration without HDDs, valid point
  • assumption of 10x speed difference between tiers breaks
  • observed case is the migration of virtual machine disks
  • tiers: NVMe SSD, TLC SSD for back storage
  • no duplicates should be hold due to space cost restriction
  • aim to avoid frequent migration of singular units
  • tested in simulations
  • metadata collected:
    • IO statistics
    • IO access patterns
    • latency tests results
  • other factors:
    • migration costs not negligible
    • future impact of optimal storage provision
  • due to speed of ssd not blocks are observed but complete files (in this case virtual machine disks)
  • the "Optimization Framework" seems well extractable and could be integrated into haura, maybe handle complete Objects like this?
  • Around the "Optimization Framework" another Approximation is done, as we would need to solve NP-hard problems for this (Set), their algorithm is at max n^2k

LRU and FIFO (Cache) 4

  • eviction policies
  • focus on very large scale systems
  • they find: FIFO sometimes has a better overall cost than LRU, even with a lower hitrate
  • main point: LRU is not always better than FIFO
  • cache is now much more than just main memory and can be persistent, NVM, SSD
  • overall good for LRU and FIFO references but not for our purpose of finding

hatS: A Heterogeneity-Aware Tiered Storage for Hadoop 5

  • Full migration
  • Extensive paper which includes much more about hadoop, we skip forward to their data placement and retrieval policies
  • hadoop: write once, read many; we differ from this a bit since we well support write many, read many
  • they also create replicas for improved throughput, the rest is exactly our conditions
  • file level granularity
  • files are duplicated in multiple tiers, also for data loss protection reasons
  • tier aware policies depending on large selection of tiers
  • each tier is assigned a weight and a weighted random functions determines the tier used for data retrieval
    • weights come from IOPS and capacity
  • they combine the tier-aware policy with another network-aware policy they created in the paper

RTHMS: a tool for data placement on hybrid memory system 6

  • memory mapping (DRAM and HBM)

Machine Learning

Data Prefetching for Large Tiered Storage Systems (IBM) 7

  • Focus: prefetching from tape or cloud storage to faster tiers
  • works without file access history/traces; some files in long storage may not have them
  • uses file metadata for access patterns; here thought of as arbitrary information one might save for this purpose
  • long time archival is more influenced by user behavior (less predicatable) than application behavior (more predictable)
  • active and passive metadata provision; they mention e.g. tagging context ids to blocks
  • other work mentioned works on metadata affinity (more similar ones more likely to be fetched)
  • access information is too slow with IO; as read operations are often followed shortly afterwards with more read requests shorter than the time required for a prefetch from cold storage
  • performed better than simple thresholds
  • they selected specific metadata fields related to their case study of radio telescope data; largely semantic

SLRL: A Simple Least Remaining Lifetime File Evicition policy for HPC multi-tier storage systems 8


  • Differentiation between prefetching and eviction policies
  • supposed to have at best 10% better hit ratio than LRU
  • they focus on eviction for this paper and all assumptions made in these notes are relevant to eviction policies
  • LRU not optimal for modern IO traces (especially named here object stores)
    • metadata management blamed for performance issues; hit-rate is still good but cost is increased (argued in IBM paper)
    • FIFO actually behaves better in this use case
    • for us this is maybe not that relevant as metadata is managed separately and does not need to follow this migration policy necessarily
  • they use the lifetimes from 9 and carry them in a queue to evict them in FIFO style
  • on low locality worse performing than FIFO or LRU
  • on high locality better FIFO(40%) and LRU(10%)
  • lifetime (time between creation and last access) predicted on creation
  • eviction happens on low storage space

Predicting file lifetimes for data placement in multi-tiered storage systems for HPC 10

  • eviction
  • CNN
  • access path depending, each segment is used as an indicator
  • two tier architecture, result is either archival or performance
  • though they open the possibility of introducing more such as 3dX optane storage so maybe also NVM
  • They build open an existing metadata engine which manages and fetches metadata from a separate database
  • main assumption is that the file creation is mostly followed by the last access to the file in scientific applications
  • much more data is written than is ever received
  • complex neural network construction, paths are simply used as pre-learning input
  • the same dataset was used for validation as was for training but randomly selected 30% the transferability of this model is not obvious

Predicting File Lifetimes with Machine Learning 9

  • eviction
  • Random Forest & CNN
  • very similar approach to 10, paths are used with neural networks
  • group and user of the file held as valuable here; though this is not the case for use

Block Popularity Prediction for Multimedia Storage Systems Using Spatial-Temporal-Sequential Neural Networks 11

  • tested on traces build as the following: operation,logical block address,operation size,time
  • hot data blocks change rapidly and are somewhat uncertain in pattern prediction
  • complex behavior from just data blocks difficult (here we could work with hinting though, like 12)
  • correlation between blocks exist but are challenging to design a general neural network for

most of what we've seen yet in the paper research is the preparation of traces from a known application and see if that improves the workflow

  • they want to extend the limited neural network approach by considering a specific use case - multimedia
    • divide io behavior (frequency) into time blocks; here 100000 IOP
    • handle each block like a time series problem; as they seem to observe strong spatial and temporal correlations
    • predict next time step
  • data recorded and formatted to include spatial and temporal information (like for us object or dataset accordance and time)

This can abstracted by recording workloads like scientific and user based though, it should not outperform metadata based approaches in some cases.

  • large blocks used, 64M in one of the datasets, but also 4M blocks scenarios are tested
  • features they extract for learning:
    • frequency of read/write and in correlation to another
    • read size avg,min,max
    • alignment read (size = 2^n) and non-aligned reads, as well as their ratios
    • neighborhood frequency; previous 10 blocks and following 10 blocks are used as a vector with their read frequency
  • heavily focused later on read operations

maybe checkpoints break this assumption as they are mostly written and more seldom read if they are read at all

  • extensive description of the network architecture
  • multiple variations of the network are tested which observe quite different behavior in their evaluation

Automating Distributed Tiered Storage Management in Cluster Computing 13

Archivist: A Machine Learning Assisted Data Placement Mechanism for Hybrid Storage Systems 14

  • file tier prediction algorithm based on multiple classifiers, though in the end they choose neural networks as the best candidate

  • they consider movement cost in their desired end-class

  • base approach:

    • record set of file IO
    • determine "optimal" storage location for a file
    • train the network to predict this location
    • use the network to determine storage locations
    • update if the accuracy becomes to low (<80%)
      • they don't elaborate this in their text but I assume as they talk about keeping the IO profiles that they classify at this point the last $num operations to find acceptable storage tiers
  • results are not too indepth but the approach is interesting and can be discussed in the background

A reinforcement learning framework for online data migration in hierarchical storage systems 15

  • they argue that heuristically "claims are harder to justify that they improve the overall system behavior"
  • future proof optimizations to avoid frequent reshuffeling -> performance impacting
  • heavy impact on pre-emptive loading and migration or on next access, though this does not prevent full storage
  • they use the latter as mostly performance is in focus, could be interesting
  • no mentioning of which inputs are used, generalists method only mentioning that obv. read/write events are the carriers

Efficient Hierarchical Storage Management Empowered by Reinforcement Learning 16

Implementation Candidate

  • they use a similar build to our consisting of SSD, HDD, and some remote object store
  • hotness-scale of some float 0 to 1
  • file granularity (data points: hotness, size, type, use-case dependent properties)
  • LRU, LFU and size temperature replacement too static
  • this is supposed to allow for more dynamic access patterns
  • online approach
  • looks doable, they define their inputs clearly and argue that they can be changed
  • states and actions are defined

An Intelligent Data Placement Strategy for Hierarchical Storage Systems 17

  • base case explanation and paper references to them

  • Q-Learning approach

  • they evaluate other approaches as inflexible, or too difficult for the user (hinting)

  • they reference a paper about advantages of reinforcement learning for continuous decision problems (our case, mom look!)

  • they reference an interesting multi-objective approach based on k-means clustering which sounds interesting

  • case: deployment within a cluster; nodes share another distributed filesystem underlying e.g. Lustre; nodes also contain local flash and dram storage

  • they consider also load balancing between the tiers

  • the reward structure is the time spent in io time, which is reasonable and we can already measure

  • file based approach as otherwise too much overhead

  • different workflow dependent characteristics which can be partially extracted from SLURM for example:

    1. number of child tasks of the current task
    2. number of remaining tasks of the current workflow
    3. number of input files of the current task
    4. number of remaining output files of the current task
    5. file read frequency
    6. remaining capacity of all layers (3 in their case)
  • not quite online learning as due to the integrated neural network, this has to be trained first on simulation workflows

  • online learning then decides based on the workflow files the control- and data-flow, they can then store state information about the decision and the actions the data experienced to adjust the neural network periodically

Tiered Data Storage Model 18

  • tbh: this paper is a bit wild

  • they reduce the problem of data placement to the clustering problem which they solve with a kohonen neural network

  • the input in this case is the metadata of a file and the distribution of files is represented in a 2 dimensional matrix whereas one dimension is the tier and the other individual disks (think raid)

  • so the base idea is we learn a number of metadata initials where to place them in our matrix

  • ongoing migration is then solved with a frequency based approach, where a certain range of request frequencies is defined (i suppose for each tier? not closer specified in the paper, and a lot of administration effort)


HFA: A Hint Frequency-based approach to enhance the I/O performance of multi-level cache storage systems 12

  • Largely framework design, may be get something else too
  • division into three kinds of hints - promotion - demotion - application

HintStor: A Framework to Study I/O Hints in Heterogeneous Storage 19

  • Concept of stream ID:

    • This largely adds multiple pattern access hints, it seems to bring a huge impact on performance, especially since this is also bound to certain storage tiers,
      • HDD: sequential read, sequential write
      • SSD: random read
      • SCM(NVM): random write
      • user facing classes:
        • read-intensive
        • write-intensive
        • backup/archival
      • though this is freely expandable, and we might want to give additional hints when working with stream id like interface
  • Data clasification

    • base approach to determine which files of unknowns or equal class to migrate/replicate
    • smaller files are preferred
    • this can be well abstracted into the object store interface
    • some additional messages are required for this though we have all the data
  • four base operations are designed for the base block-wise operations

    1. redirect - data was moved and the data is now at an external position (disk difference) not relevant to us
    2. migrate - heatmap of most used chunks are created and the most frequently used items are moved in a top-k scheme every 2 hours (time range can be modified, not very reactive)
    3. replicate - store multiple versions of a chunk, meant to decrease latency on very active drives
    4. prefetch - load to some kind of buffer e.g. DRAM, allows for partial transfers from disk to cache -> lower latency as they can be answered faster and do not require locking of chunks
  • I/O access hints

    • static: metadata, file size, file type, migration interval, chunk size
    • smaller files can be easier promoted, larger harder due to high amount of ineratia
    • shorter migration interval brings logically a lower average latency, they move 1000 files at once each interval
    • dynamic: file open, write/read file, stream ID, cloud prefetch, task priority (background and foregrund tasks)
    • priority could be interesting though probably for another day, as this can bring a lot more implementation effort
  • they achieve some good results in their scenarios, in which they used btrfs and ext4 as the modified file systems

  • Main Conclusions:

    • improvement can be made with data placement by evaluating static system hints as defined above, though several levels must be used, metadata, file size, types,...
    • streamIDs improve the caching efficiency, data which can be served well with lower tiers might not need to be moved upwards

Towards a QoS-aware Virtualised Storage System 20


Multi-Tier Buffer Management and Storage System Design for Non-Volatile Memory 21

Optimizing Hierarchical Storage Management For Database System 22

Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems 23

  • bulk data movement and placement of datasets

    • similar to datasets as they are bound to scientific applications in HPC
  • division into control and data plane (similar to our daemon setup with control messages in a centralized service)

  • somewhat attached to job schedulers

  • they implement a complete resource manager and IO interface to assist the scientific workflow style, see also job schedulers

  • many automations step are meant to be included like automated archival for preservation

  • as well as interface to work with multiple cluster

  • works with multiple replications

  • "migration" is in this case simply done via a file system posix interface, or with a copy in geograhically distributed storage

    • the policy might still be interesting: Data Orchestration
  • "Data Orchestration":

    • data placement: users can specify a preferred target when writing
    • data movement (migration): tries to fit user choice, but with priority to decide on tight storage
    • resource management: replication and promotion goals are determined via graph algorithms where bandwidth between locations are considered
  • automation largely done for preconfigured requirements and distributed usage in clusters, ideas about burst buffers nice but nothing new, graph approach is interesting but not too usable for us in the current stage

Sorted out - non-applicable

Sorted out - unrelated

  1. ↩︎

  2. ↩︎

  3. ↩︎

  4. ↩︎

  5. ↩︎

  6. ↩︎

  7. ↩︎

  8. ↩︎

  9. ↩︎

  10. ↩︎

  11. ↩︎

  12. ↩︎

  13. and ↩︎

  14. ↩︎

  15. ↩︎

  16. ↩︎

  17. ↩︎

  18. ↩︎

  19. ↩︎

  20. ↩︎

  21. ↩︎

  22. ↩︎

  23. ↩︎