Infrastructure Automation with Salt Sean McGrath 10th November 2016
About Research IT Where I work as a systems administrator http://www.tchpc.tcd.ie/ Ireland s premier High Performance Computing Centre with large scale Supercomputing and Visualisation facilities. Assisting Researchers with computationally complex problems. Previously The Trinity Centre for High Performance Computing, TCHPC Manage in the region of 500 physical and virtual linux machines
About Configuration management Configuration management (CM) is a systems engineering process for establishing and maintaining consistency of a product s performance, functional, and physical attributes with its requirements, design, and operational information throughout its life. 1 Dangers of Configuration Management Can t just do something on the fly, must be centrally managed There is a risk that local changes will get over written You have to learn another tool Examples: Salt Puppet Ansible Lots more 1 https://en.wikipedia.org/wiki/configuration_management
Why use a Configuration Manager? It allows you to be: Scalable: 400 HPC nodes with identical configuration Granulated: specific configurations can be applied to a specific node(s) Automated: you write your salt statements once but call them repeatedly Repeatable: it s the same salt statement that is applied each time Efficient: once configured you don t have to intervene with specific machines
About Salt Salt (also called SaltStack): https://saltstack.com/ Client Server design. Service runs on the managed nodes (minions). Minions controlled by the master. Master can push changes or minions can pull from the master. Trust relationship managed via asynchronous keys Some of what can be managed: package installation configuration files, sourced from master, templates, find and replace in a file... enabled and running services specific settings or facts, e.g. db password, for individual minions much more
Why salt over ansible or puppet Firstly, this is nothing against either of those, which we have both used. But from our experience Salt has advantages: Usable error messages - Ansible s error messages lack detail Granularity - Ansible is hard to do a single thing to 1 node with Scale - Ansible is slow in comparison to Salt YAML - Puppet requires some ruby skills I at least don t have Flexibility - Puppet always seemed very controlling and designed for tight control Lets see some use case examples:
Use Case 1 - re-install a 100 node cluster Step 1. Re-install OS on nodes with PXE Step 2. Install and configure salt on nodes s a l t s s h i g n o r e host k e y s passwd x x x x x x x x k e l v i n n s t a t e. s l s i n s t a l l s a l t Step 3 - reboot node s salt ssh ignore host keys passwd xxxxxxxx kelvin n cmd. run reboot Comment Salt scales very well in this instance, much better than Ansible It offers great efficiencies, once developed your salt state s automate all your work for you Can be easily repeated in future Lets look at what exactly the installsalt state does.
Use Case 1 - installsalt state i n s t a l l e p e l : pkg : i n s t a l l e d pkgs : e p e l r e l e a s e yum conf e p e l salt minion : pkg : i n s t a l l e d c r e a t e f o l d e r s f o r k e y s : f i l e : name : / e t c / s a l t / p k i / minion / m i n i o n m a s t e r. pub managed makedirs : True {% f o r f i l e i n [ m i n i o n m a s t e r. pub, minion. pem, minion. pub ] %} / e t c / s a l t / p k i / minion /{{ f i l e }}: f i l e : managed s o u r c e : s a l t : / / i n s t a l l s a l t /{{ f i l e }}.{{ c l u s t e r n a m e }} watch in : s e r v i c e : r e s t a r t s a l t minion {% e n d f o r %} r e s t a r t s a l t minion : s e r v i c e : name : salt minion r u n n i n g / e t c / r c. l o c a l : f i l e : append t e x t : s l e e p 1 0 ; s a l t c a l l s t a t e. h i g h s t a t e p i l l a r = {\ r e b o o t \ : \ y e s \ } #w a i t 10
Use Case 2 - Bootstrap installation Scenario: new VM and want to put your common config on it. Assumptions: OS and salt installed on VM minions key signed by master From the salt master: s a l t n e w s e r v e r. fqdn s t a t e. s l s b o o t s t r a p Which does: i n c l u d e : g e n e r a l. i n i t # i n c a s e a h i g h s t a t e i s n t c a l l e d e n s u r e t h e s t a t e s t h a t s h o u l d a p p l y t tchpc general. epel tchpc g e n e r a l. r e p o s i t o r i e s. l o c a l tchpc general. nrpe tchpc general. sh orew all tchpc g e n e r a l. p o s t f i x tchpc general. snmp tchpc g e n e r a l. s e r v i c e s tchpc g e n e r a l. r s y s l o g tchpc general. check updates tchpc general. bacula Why: your standard setup is repeatedly and automatically applied.
Use Case 3 - Install a web server Installation State: i n s t a l l common p a ckages : pkg : i n s t a l l e d pkgs : h t t p d php php d e v e l mysql php l d a p m o d s s l o p e n s s l php mysql php pear MDB2 D r i v e r mysql m o d a u t h z l d a p h t t p d s e r v i c e : s e r v i c e : name : httpd r u n n i n g enable : True From the salt master: s a l t w e b s e r v e r. fqdn s t a t e. s l s i n s t a l l w e b s e r v e r That is a very efficient way to get the usual web server settings applied to a host without having to reference standard install documentation.
Use Case 4 - software upgrade testing Scenario: kernel needs to be updated. Software, (GPFS - parallel file system) depends on kernel version. Both need to be updated simultaneously. Want to test on a subset of nodes first. Tell the minion what version to use - pillar Pillar is an interface for Salt designed to offer global values that can be distributed to minions. 2 Identify the minion to apply the pillar variable to - grains interface to derive information about the underlying system. This is called the grains interface 3 2 https://docs.saltstack.com/en/carbon/topics/pillar/index.html 3 https://docs.saltstack.com/en/latest/topics/grains/
Use Case 4 - install specific versions on specific node Set the pillar to the updated versions for your test node: # t e s t i n g upgraded v e r s i o n s on s p e c i f i c nodes : {% i f s a l t [ g r a i n s. get ] ( id ) [ 0 : 1 1 ] = = k e l v i n n038 %} g p f s v e r s i o n : 3.5.0 32 {% e l s e %} g p f s v e r s i o n : 3.5.0 29 {% e n d i f %} {% i f s a l t [ g r a i n s. get ] ( id ) [ 0 : 1 1 ] = = k e l v i n n038 %} k e r n e l v e r s i o n : 2. 6. 3 2 6 4 2. 3. 1. e l 6. x86 64 {% e l s e %} k e r n e l v e r s i o n : 2. 6. 3 2 5 7 3. 1 2. 1. e l 6. x86 64 {% e n d i f %} Install the relevant kernel version, (pillar variable) for the node, (identified by grain). {% i f g r a i n s [ k e r n e l r e l e a s e ]!= s a l t [ p i l l a r. get ] ( k e r n e l v e r s i o n ) %} i n s t a l l k e r n e l p a c k a g e s ( cmd ) : cmd : run name : yum y i n s t a l l k e r n e l h e a d e r s {{ s a l t [ p i l l a r. get ] ( k e r n e l v e r s i o n ) }} k e r This provides excellent granularity without having to provision a test environment.
Use Case 5 - GPU card installation Process: 1. Remove unsupported kernel modules and reboot to load correct kernel modules. 2. Generate a new ramdisk without the supported kernel modules. 3. Boot from that new ramdisk. 4. Install the GPU drivers and reboot. Limitation: Only want to run this state on a node with GPU hardware. Gotcha: Possible infinite loop of reboots unless the minion knows each step has completed successfully. Solution: Set a grain after each step.
Use Case 5 - Continued Ensure this state only runs on a node with the gpu installed in it: {% i f s a l t [ g r a i n s. get ] ( gpus : model ) == GK110BGL [ T e s l a K40m ] %} Unload kernel modules # the nouveu modules need to be removed from the k e r n e l / e t c / modprobe. d/ b l a c k l i s t nouveau. c o n f : f i l e. managed : s o u r c e : s a l t : / / c l u s t e r s / nodes /gpu b o o l e / b l a c k l i s t nouveau. c o n f mode : 644 user : root group : r o o t Reboot, requires salt being called with a reboot pillar set {% i f p i l l a r [ reboot ]!= yes %} always f a i l s gpu : t e s t. f a i l w i t h o u t c h a n g e s : name : MESSAGE the minion should reboot f a i l h a r d : True {% e n d i f %} # end i f p i l l a r [ reboot ]!= yes %}
Use Case 5 - Continued Generate the new ramdisk from those modules {% i f g r a i n s. g e t ( r e g e n e r a t e r a m d i s k )!= r e g e n e r a t e d %} # ramdisk needs to be re generated without the nouveu modules and node booted from i t c r e a t e ramdisk without the nouveau modules : cmd. run : name : d r a c u t f o r c e Set a grain value on the minion to say that ramdisk has been re-generated. regenerate ramdisk : module. run : name : g r a i n s. s e t v a l key : r e g e n e r a t e r a m d i s k val : regenerated Boot from the new ramdisk system. reboot ramdisk : module : name : system. reboot run r e q u i r e : module : r e g e n e r a t e r a m d i s k s t o p s a f t e r r a m d i s k r e b o o t : t e s t. f a i l w i t h o u t c h a n g e s : name : MESSAGE system rebooting f a i l h a r d : True r e q u i r e : module : system. r e b o o t ramdisk Resarch{% IT, etrinity n d i f College %} Dublin, sean.mcgrath@tcd.ie
Use Case 5 - Continued Install GPU drivers only if they haven t all ready been installed {% i f g r a i n s. g e t ( n v i d i a d r i v e r s )!= i n s t a l l e d %} I n s t a l l N v i d i a d r i v e r s : cmd. run : name : /home/ s u p p o r t / r o o t /gpu/ cuda 7. 5. 1 8 l i n u x. run s i l e n t Set the grain to say they ve been installed to prevent a reboot loop and reboot: n v i d i a d r i v e r s : module. run : name : g r a i n s. s e t v a l key : n v i d i a d r i v e r s v a l : i n s t a l l e d system. r e b o o t n v i d i a d r i v e r s : module : name : system. reboot run r e q u i r e : module : n v i d i a d r i v e r s s t o p s a f t e r n v i d i a d r i v e r s r e b o o t : t e s t. f a i l w i t h o u t c h a n g e s : # t h i s i s r e a l l y s u p p o r t e d o n l y from S a l t 2 014.7 name : MESSAGE system rebooting f a i l h a r d : True r e q u i r e : module : system. r e b o o t n v i d i a d r i v e r s {% e n d i f %}
Use Case 5 - benefits Salt provides a simple and easy way to automatically provision a complex installation. It is easily repeated if a minion has to be re-installed or new machines added. You have a centralise documentation store of what exactly needs to be done to set your installation up.
Salt take aways Salt does the work of configuring your machines for you Salt provides system documentation Salt provides a knowledge base of How to do X
Thank You! Slide source available at: https://github.com/smcgrat/presentations/blob/master/ Infrastructure_Automation_with_SaltStack.tex Questions?