The objective of this guide is to present a step-by-step guide on how to implement SUSE Enterprise Storage 6 on the Huawei Taishan platform. It is suggested that the document be read in its entirety, along with the supplemental appendix information before attempting the process.
The deployment presented in this guide aligns with architectural best practices and will support the implementation of all currently supported protocols as identified in the SUSE Enterprise Storage documentation.
Upon completion of the steps in this document, a working SUSE Enterprise Storage 6 cluster will be operational as described in the SUSE Enterprise Storage Deployment and Administration Guide.
This reference architecture is targeted at administrators who deploy software defined storage solutions within their data centers and making the different storage services accessible to their own customer base. By following this document as well as those referenced herein, the administrator should have a full view of the SUSE Enterprise Storage architecture, deployment and administrative tasks, with a specific set of recommendations for deployment of the hardware and networking platform.
The recommended architecture for SUSE Enterprise Storage on Huawei Taishan leverages two models of Huawei servers. The role and functionality of each type of system within the SUSE Enterprise Storage environment will be explained in more detail in the Chapter 5, Architectural Overview section.
Huawei 10Gb. Higher speed model is recommended if high speed NIC is installed.
SUSE Enterprise Storage 6
SUSE Linux Enterprise Server 15 SP1
Customers of all sizes face a major storage challenge: while the overall cost per Terabyte of physical storage has gone down over the years, a data growth explosion took place driven by the need to access and leverage new data sources (e.g. external sources such as social media) and the ability to "manage" new data types (e.g. unstructured or object data). These ever increasing "data lakes" need different access methods: File, Block, or Object.
Addressing these challenges with legacy storage solutions would require either a number of specialized products (usually driven by access method) with traditional protection schemes (e.g. RAID). These solutions struggle when scaling from Terabytes to Petabytes at reasonable cost and performance levels.
This software defined storage solution enables transformation of the enterprise infrastructure by providing a unified platform where structured and unstructured data can co-exist and be accessed as files, blocks, or objects depending on the application requirements. The combination of open-source software (Ceph) and industry standard servers reduce cost while providing the on-ramp to unlimited scalability needed to keep up with future demands.
SUSE Enterprise Storage delivers a highly scalable, resilient, self-healing storage system designed for large scale environments ranging from hundreds of Terabytes to Petabytes. This software defined storage product can reduce IT costs by leveraging industry standard servers to present unified storage servicing block, file, and object protocols. Having storage that can meet the current needs and requirements of the data center while supporting topologies and protocols demanded by new web-scale applications, enables administrators to support the ever-increasing storage requirements of the enterprise with ease.
Huawei Taishan servers provide a cost effective and scalable platform for the deployment of SUSE Enterprise Storage. These platforms unlocks the full potential of the Kunpeng CPU, raising the bar of SPECint benchmark by 25%, with up to 128 cores, 32 DDR4 DIMM slots, PCIe 4.0 support, and 100 GE LOM.
Featuring models tailored for computing, storage, or balanced needs, Taishan is perfect for demanding workloads such as big data analytics, database acceleration, high-performance computing, and cloud services. Taishan servers empower data centers with the ultimate efficiency.
This architecture overview section complements the SUSE Enterprise Storage Technical Overview document available online which presents the concepts behind software defined storage and Ceph as well as a quick start guide (non-platform specific).
SUSE Enterprise Storage provides unified block, file, and object access based on Ceph. Ceph is a distributed storage solution designed for scalability, reliability and performance. A critical component of Ceph is the RADOS object storage. RADOS enables a number of storage nodes to function together to store and retrieve data from the cluster using object storage techniques. The result is a storage solution that is abstracted from the hardware.
Ceph supports both native and traditional client access. The native clients are aware of the storage topology and communicate directly with the storage daemons over the public network, resulting in horizontally scaling performance. Non-native protocols, such as iSCSI, S3, and NFS require the use of gateways. While these gateways may be thought of as a limiting factor, the iSCSI and S3 gateways can scale horizontally using load balancing techniques.
In addition to the required network infrastructure, the minimum SUSE Enterprise Storage cluster comprises of a minimum of one administration server (physical or virtual), four object storage device nodes (OSDs), and three monitor nodes (MONs).
Please refer to the SES 6 Deployment Guide for more details on the hardware requirement.
One system is deployed as the administrative server. It is the Salt Master and hosts the SUSE Enterprise Storage Administration Interface, dashboard, which is the central management system which supports the cluster.
Three systems are deployed as monitor (MONs) nodes. Monitor nodes maintain information about the cluster health state, a map of the other monitor nodes and a CRUSH map. They also keep history of changes performed to the cluster.
It is strongly recommended to deploy monitors and other services on dedicated nodes. However it is also possible to deploy the monitors on the OSD nodes if there are enough hardware resources. This is the case in this specific reference setup.
The RADOS gateway provides S3 and Swift based access methods to the cluster. These nodes are generally situated behind a load balancer infrastructure to provide redundancy and scalability. It is important to note that the load generated by the RADOS gateway can consume a significant amount of compute and memory resources making the minimum recommended configuration contain 6-8 CPU cores and 32GB of RAM.
SUSE Enterprise Storage requires a minimum of four systems as storage nodes. The storage nodes contain individual storage devices that are each assigned an Object Storage Daemon (OSD). The OSD daemon assigned to the device stores data and manages the data replication and rebalancing processes. OSD daemons also communicate with the monitor (MON) nodes and provide them with the state of the other OSD daemons.
A software-defined solution is only as reliable as its slowest and least redundant component. This makes it important to design and implement a robust, high performance storage network infrastructure. From a network perspective for Ceph, this translates into:
Separation of cluster internal and client-facing public network traffic. This isolates Ceph OSD daemon replication activities from Ceph clients. This may be achieved through separate physical networks or through use of VLANs.
Redundancy and capacity in the form of bonded network interfaces connected to switches.
Figure 5.2, “Ceph Network Architecture” shows the logical layout of the Ceph cluster implementation.
Specific to this implementation, the following naming and addressing scheme were utilized.
Role | Hostname | Public Network | Cluster Network |
---|---|---|---|
Admin | admin.example.com | 10.1.1.3 | N/A |
Monitor | ceph1.example.com | 10.1.1.4 | N/A |
Monitor | ceph2.example.com | 10.1.1.5 | N/A |
Monitor | ceph3.example.com | 10.1.1.6 | N/A |
OSD Node | ceph1.example.com | 10.1.1.4 | 10.2.1.4 |
OSD Node | ceph2.example.com | 10.1.1.5 | 10.2.1.5 |
OSD Node | ceph3.example.com | 10.1.1.6 | 10.2.1.6 |
OSD Node | ceph4.example.com | 10.1.1.7 | 10.2.1.7 |
In this section, the focus is on the SUSE components: SUSE Linux Enterprise Server (SLES), SUSE Enterprise Storage (SES), and the Repository Mirroring Tool (RMT).
A world class secure, open source server operating system, equally adept at powering physical, virtual, or cloud-based mission-critical workloads. SUSE Linux Enterprise Server 15 SP1 further raises the bar in helping organizations to accelerate innovation, enhance system reliability, meet tough security requirements and adapt to new technologies.
Allows enterprise customers to optimize the management of SUSE Linux Enterprise (and other products such as SUSE Enterprise Storage) software updates and subscription entitlements. It establishes a proxy system for SUSE Customer Center with repository and registration targets.
Provided as an product on top of SUSE Linux Enterprise Server, this intelligent software-defined storage solution, powered by Ceph technology with enterprise engineering and support from SUSE enables customers to transform enterprise infrastructure to reduce costs while providing unlimited scalability.
This deployment section should be seen as a supplement to the SUSE official documentations. Please refer to the Chapter 10, References and Resources for the list of related SUSE documents.
It is assumed that a Subscription Management Tool server or a Repository Mirroring Server exists within the environment. If not, please follow the RMT Guide and Section 8.11, “Offline setup” to make one available.
In this document we use example.com
as the domain name for the nodes, replace it with your real domain name in your own installation.
The following considerations for the network configuration should be attended to:
Ensure that all network switches are updated with consistent firmware versions.
Depending on the network interface bonding mode used on the servers, corresponding switch port configuration may be required. Please consult your network administrator for that topic as it’s out of the scope of this document.
Network IP addressing and IP ranges need proper planning. In optimal environments, a dedicated storage subnet should be used for all SUSE Enterprise Storage nodes on the primary network, with a separate, dedicated subnet for the cluster network. Depending on the size of the installation, ranges larger than /24 may be required. When planning the network, current size as well as future growth should be taken into consideration.
The following considerations for the hardware platforms should be attended to:
Configure the system to run in performance mode if you prefer performance over power efficiency. To change that, reboot the system, press Del when prompted during system initializing to boot into the BIOS setup menu. Then select › › , select for best performance or for power efficiency.
In case you don’t see the SUSE Installer screen after boot up from the SUSE installation medium, check the BIOS option
› › and set it to .A RAID-1 volume consists of two 600GB SAS hard drive is enough for the OS disk.
If hard drives are connected to hardware RAID controller(s) with hardware write cache, configure each of them as individual RAID-0 volume and make sure hardware caching is enabled.
Try to balance the drives across controllers, ports, and enclosures. Avoid making one part of the I/O subsystem busy while leaving other parts idle.
If SAS/SATA SSDs are installed, make sure to attach then to a dedicated HBA or RAID controller rather than to the controller that already has many HDDs attached.
The following considerations for the Operating System should be attended to:
The underlying OS for SES 6 is SUSE Linux Enterprise Server 15 SP1. Other OS versions are not supported. During installation, make sure below addon modules are selected.
Base system Module
Server Applications Module
SUSE Enterprise Storage 6
SUSE Enterprise Storage is a paid product in it’s own. You need to purchase the subscription before you can install it as an add-on on top of SUSE Linux Enterprise Server.
During installation, don’t select any GUI components such as X-Window system, GNOME or KDE, as they are not needed to run the storage service.
It is highly recommended to register the systems to an update server to install the latest updates available, helping to ensure the best experience possible. The systems could be registered directly to SUSE Customer Center if it is a small cluster, or could be registered to a local SMT or RMT server when the cluster is large. Installing updates from a local SMT/RMT server will dramatically reduce the time required for updates to be downloaded to all nodes.
Refer to Repository Mirroring Tool Guide for how to setup a RMT server.
Ensure that the operating system is installed on the correct device. Especially on OSD nodes, the installer may not choose the right one from many available drives.
Hostnames of all nodes should be properly configured. Full hostname (i.e. with domain name) should always be assigned for each node or else the deployment may fail. Make sure hostname -s
, hostname -f
and hostname -i
commands return proper results for short hostname (without dots), full hostname and IP addresses. Each node must also be able to resolve hostname of all nodes, including its own name.
For a rather small cluster, hosts files can be used for name resolution. Also see Section 8.2, “Copy files to all cluster nodes” for how to conveniently keep the hosts on all nodes in sync.
Having a DNS server is recommended for a larger cluster. See the SUSE Linux Enterprise Server Administration Guide for how to setup a DNS server.
Do ensure that NTP is configured to point to a valid, physical NTP server. This is critical for SUSE Enterprise Storage to function properly, and failure to do so can result in an unhealthy or non-functional cluster. And keep in mind that the NTP service is not designed to be run on an virtualized environment, so make sure the NTP server been used is an physical machine or it may cause strange clock drifting problem.
Salt, along with DeepSea, is a stack of components that help deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running.
There are three key Salt imperatives that need to be followed:
The Salt Master is the host that controls the entire cluster deployment. Ceph itself should NOT be running on the master as all resources should be dedicated to Salt master services. In our scenario, we used the Admin host as the Salt master.
Salt minions are nodes controlled by Salt master. OSD, monitor, and gateway nodes are all Salt minions in this installation.
Salt minions need to correctly resolve the Salt master’s host name.
Deepsea consists of series of Salt files to automate the deployment and management of a Ceph cluster. It consolidates the administrator’s decision making in a single location around cluster layout, node role assignment and drive assignment. Deepsea collects each set of tasks into a goal or stage.
The following steps, performed in order, were used for this reference implementation. All commands were run by root user.
Install salt master on the Admin node:
zypper in salt-master
Start the salt-master service and enable start on boot:
systemctl enable --now salt-master.service
Install the salt-minion on all cluster nodes (including the Admin):
zypper in salt-minion
Configure all minions to connect to the Salt master:
Create a new file /etc/salt/minion.d/master.conf
with the following content:
master: admin.example.com
Restart the salt-minion service and enable it:
systemctl restart salt-minion.service systemctl enable salt-minion.service
List Salt fingerprints on all the minions:
salt-call --local key.finger
List all incoming minion fingerprints on the Salt master, verify them against the fingerprints on each minions to make sure they all match. If they do, accept all Salt keys on the Salt master:
salt-key -F salt-key --accept-all salt-key --list-all
Verify if Salt works properly by "ping" each minions from the Salt master. They should all return True on success:
salt '*' test.ping
Now check and make sure the time on all nodes are the same. In later stage Deepsea will setup all nodes to synchronize time from the admin node, but before that is done, strange errors may occur if time on each node are largely out of sync. So it’s better to set all nodes to the same time manually first. For example, run below command on your admin node:
salt '*' cmd.run 'date -s "2020-03-19 17:30:00"'
Use your actual time in the same format when running the command. It doesn’t have to be super accurate, as later all nodes will be synchronized by the chrony time service.
If the OSD nodes were used in a prior installation, or the disks are used by other applications before, zap ALL the OSD disks first.
This must be done on all the OSD disks that were used before, or else the deployment may fail when activating OSDs.
Below commands should not be copied and executed on your installation blindly. The device names used below are just examples, you need to change them to match only the OSD disks in your own installation. Failed to use the correct device name may erase your OS disk or other disks that may hold valuable data.
Wipe the beginning of each partition:
for partition in /dev/sdX[0-9]* do dd if=/dev/zero of=$partition bs=4096 count=1 oflag=direct done
Wipe the beginning of the drive:
dd if=/dev/zero of=/dev/sdX bs=512 count=34 oflag=direct
Wipe the end of the drive:
dd if=/dev/zero of=/dev/sdX bs=512 count=33 \ seek=$((`blockdev --getsz /dev/sdX` - 33)) oflag=direct
Install deepsea package on Admin node:
# zypper in deepsea
Check /srv/pillar/ceph/master_minion.sls for correctness.
Check /srv/pillar/ceph/deepsea_minions.sls file, make sure the deepsea_minions option targets the correct nodes. In the usual case, it can simply be put like below to match all Salt minions in the cluster:
deepsea_minions: '*'
Create /srv/pillar/ceph/stack/ceph/cluster.yml with below options:
cluster_network: <net/mask of the cluster network> public_network: <net/mask of the public network> time_server: <Address of the NTP server, if this line is omitted admin node will be used>
At this point Deepsea commands can be run on the admin node to deploy the cluster.
Each command can be run either as:
salt-run state.orch ceph.stage.<stage name>
Or:
deepsea stage run ceph.stage.<stage name>
The latter form is preferred as it outputs real time progress.
During this stage, all required updates are applied and your nodes may be rebooted.
deepsea stage run ceph.stage.0
If the Salt master reboots during Stage 0, you need to run Stage 0 again after it boots up.
Optionally, create the /var/lib/ceph btrfs subvolume:
salt-run state.orch ceph.migrate.subvolume
During this stage, all hardware in your cluster is detected and necessary information are collected for the Ceph configuration.
deepsea stage run ceph.stage.1
Configure cluster and public network in /srv/pillar/ceph/stack/ceph/cluster.yml if not yet done as described in Create cluster.yml.
Now a /srv/pillar/ceph/proposals/policy.cfg file needs to be created to instruct Deepsea on the location and configuration files to use for the different components that make up the Ceph cluster (Salt master, admin, monitor, OSD and other roles).
To do so, copy the example file to the right location then edit it to match your installation:
cp /usr/share/doc/packages/deepsea/examples/policy.cfg-rolebased /srv/pillar/ceph/proposals/policy.cfg
See Appendix A, policy.cfg example for the one used when installing the cluster described in this document.
During this stage necessary configuration data are prepared in particular format.
deepsea stage run ceph.stage.2
Use below command to check the attributes of each node:
salt '*' pillar.items
Ensure the public and cluster network attributes are the same as configured.
DriveGroups information are defined in the file /srv/salt/ceph/configuration/files/drive_groups.yml. It specifies what drives should be used for data device, DB device, or WAL device, and other parameters for setting up the OSDs.
First take a look of all the disks on all OSD nodes:
salt-run disks.details
It lists the vendor, model, size and type of the disks. Those information can be used to match a group of drives and assign them to different uses.
Now define drive groups in the drive_groups.yml file.
See Appendix B, drive_groups.yml example for the drive group definition used in this example cluster. For complete information refers to the Deployment Guide
After finished editing drive_groups.yml, run below commands to see the result definition. Exam it carefully and make sure it meets your expectation before moving on to next step.
salt-run disks.list salt-run disks.report
A basic Ceph cluster with mandatory Ceph services is created in this stage.
deepsea stage run ceph.stage.3
It may take quite some time for above command to finish if your cluster is large, your have a lot of disks, or your Internet bandwidth is limited while you didn’t register the nodes to local SMT server.
After the above command is finished successfully, check whether the cluster is up by running:
ceph -s
Additional features of Ceph like iSCSI, Object Gateway and CephFS can be installed in this stage. Each is optional and up to your situation.
deepsea stage run ceph.stage.4
After the above command finishes successfully, the SUSE Enterprise Storage cluster is considered fully deployed.
The steps below can be used to validate the overall cluster health, by creating a test
storage pool and run some write and read test on it.
ceph status ceph osd pool create test 1024 rados -p test bench 300 write --no-cleanup rados -p test bench 300 seq
Once the tests are complete, you can remove the test pool via:
ceph tell mon.* injectargs --mon-allow-pool-delete=true ceph osd pool delete test test --yes-i-really-really-mean-it ceph tell mon.* injectargs --mon-allow-pool-delete=false
The default time server is the admin node. To change it, add
time_server: <server address>
in /srv/pillar/ceph/stack/ceph/cluster.yml
salt-cp
command can be used to copy files from the salt master node to minion nodes. This can be very convenient, for example, to keep /etc/hosts file in sync on all nodes.
salt-cp '*' /etc/hosts /etc/hosts
Salt minion configuration file
Salt minion name. Useful if changed host name and need to change minion name accordingly.
Deepsea minion targets
Deepsea cluster configuration for the cluster "ceph" (the default cluster name). After modification Deepsee stage 2 need to be run to make it in effect.
Affects all minions in the Salt cluster.
Affects all minions in the cluster named "ceph".
Affects all minions that are assigned the specific role in the ceph cluster.
Affects the individual minion.
In case you did something wrong and would like to start over without re-installing the whole OS.
# salt-run disengage.safety # salt-run state.orch ceph.purge
# salt '*' saltutil.pillar_refresh # salt '*' pillar.items
This will only give information after running stage 1 AKA the discovery stage.
See the Administration Guide
# salt-run net.iperf cluster=ceph output=full
# rados -p <pool name> bench 60 write
Use following parameters for the bonding module in 802.3ad mode (need switch support).
mode=802.3ad miimon=100 lacp_rate=fast xmit_hash_policy=layer3+4
See the Deployment Guide
Setup a SMT or RMT server, and mirror below repositories from SCC.
SLE-Product-SLES15-SP1-Pool
SLE-Product-SLES15-SP1-Updates
SLE-Module-Server-Applications15-SP1-Pool
SLE-Module-Server-Applications15-SP1-Updates
SLE-Module-Basesystem15-SP1-Pool
SLE-Module-Basesystem15-SP1-Updates
SUSE-Enterprise-Storage-6-Pool
SUSE-Enterprise-Storage-6-Updates
Then point all nodes to the SMT/RMT server.
After change of node roles by editing policy.cfg, need to run Stage 2 Configure to refresh configuration files.
# deepsea stage run ceph.stage.2
Check the SES 6 Administration Guide for more hints & tips, FAQ, and troubleshooting techniques.
The Huawei Taishan series represents a strong capacity-oriented platform. When combined with the access flexibility and reliability of SUSE Enterprise Storage and the industry leading support from Huawei, any business can feel confident in the ability to address the exponential growth in storage they are currently faced with.
https://www.suse.com/media/white-paper/suse_enterprise_storage_technical_overview_wp.pdf
https://www.suse.com/products/suse-enterprise-storage/#tech-specs
https://www.suse.com/releasenotes/x86_64/SUSE-Enterprise-Storage/6/
https://documentation.suse.com/ses/6/single-html/ses-deployment/#book-storage-deployment
https://documentation.suse.com/ses/6/single-html/ses-admin/#book-storage-admin
https://documentation.suse.com/sles/15-SP1/single-html/SLES-deployment/#book-sle-deployment
https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#book-sle-admin
https://documentation.suse.com/sles/15-SP1/single-html/SLES-storage/#book-storage
https://documentation.suse.com/sles/15-SP1/single-html/SLES-rmt/#book-rmt
## Cluster Assignment
cluster-ceph/cluster/*.sls
## Roles
# ADMIN
role-master/cluster/admin*.sls
role-admin/cluster/admin*.sls
# Monitoring
role-prometheus/cluster/admin*.sls
role-grafana/cluster/admin*.sls
# MON
role-mon/cluster/ceph[123]*.sls
# MGR (mgrs are usually colocated with mons)
role-mgr/cluster/ceph[123]*.sls
# MDS
role-mds/cluster/ceph2*.sls
# IGW
role-igw/cluster/ceph3*.sls
# RGW
role-rgw/cluster/ceph4*.sls
# NFS
# role-ganesha/cluster/ganesha*.sls
# COMMON
config/stack/default/global.yml
config/stack/default/ceph/cluster.yml
# Storage
role-storage/cluster/ceph[1234]*.sls
default:
target: 'I@roles:storage'
data_devices:
# Use all hard disks as data device
rotational: 1
db_devices:
# Use solid state drives as db device
rotational: 0