Meltdown and Spectre bugs: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Marked this version for translation)
 
(23 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{Draft}}
<languages />


Meltdown and Spectre are bugs related to speculative execution in a variety of CPU architectures developed during the past ten to fifteen years and which affect in particular processors from Intel and AMD, including those in use on Compute Canada clusters. A detailed discussion of the two bugs can be found on [https://arstechnica.com/gadgets/2018/01/meltdown-and-spectre-every-modern-processor-has-unfixable-security-flaws/ this page] and Compute Canada personnel are currently patching all of the systems vulnerable to attacks based on these bugs. What sort of performance degradation users will observe as a consequence of the patches is dependent on the software you are using and how it interacts with the operating system but in general the more filesystem activity and other input/output operations that a program performs during its execution, the more likely it is to suffer from a slowdown. Some benchmarks of the performance loss for AI and machine learning codes are publicly [https://medium.com/implodinggradients/meltdown-c24a9d5e254e available] and we recommend that users consider running some simple tests of their own to see if there is any substantial loss of performance with their own code(s).
<translate>
<!--T:1-->
Meltdown and Spectre are bugs related to speculative execution in a variety of CPU architectures developed during the past ten to fifteen years and which affect in particular processors from Intel and AMD, including those in use on Compute Canada clusters. A detailed discussion of the two bugs can be found on [https://arstechnica.com/gadgets/2018/01/meltdown-and-spectre-every-modern-processor-has-unfixable-security-flaws/ this page]. Compute Canada personnel have patched systems deemed sensitive to these vulnerabilities.  


== What are the impacts ? ==
== What are the impacts ? == <!--T:2-->
=== Availability impacts ===
=== Availability impacts ===
Updates to patch the vulnerabilities require updating the operating system and rebooting the nodes. For compute nodes, this is typically done in a rolling fashion, resulting in nodes being unavailable for a short period of time. This may impair the scheduling of large jobs, but typically goes unnoticed by users. Some nodes, such as login nodes and cloud hosts will however see a short interruption of service.
Updates to patch the vulnerabilities required updating the operating system and rebooting the nodes. For compute nodes this was typically done in a rolling fashion, was largely transparent to users, and is now complete.


=== Performance impacts ===
<!--T:11-->
Many groups around the world, including within Compute Canada, are running various benchmarks. Numbers cited can sometimes be scary (up to 30 or even 50% performance hit), while other are very minimal. Tasks which do a lot of input/output (reading and writing files) seem to be most impacted. Such tasks include databases, file transfer (rsync). Typical high performance tasks and GPU tasks are not affected nearly as much. Different processor generations are also affected at different levels. If you suspect your jobs may incur a large performance penalty, we recommend that you compare its performance before and after. You can also contact our [[Technical support]] to get help, but keep in mind that code modification may be required to lessen the performance impact on any given code.  
Updates were applied at [[Graham]] between 2018 January 5 and January 31. Most nodes were updated by January 13.


Below, you will find various links pointing to performance benchmarks. Keep in mind that those were not necessarily run on hardware and operating system similar to what Compute Canada clusters are running.
=== Performance impacts === <!--T:3-->
Many groups around the world have run benchmarks to evaluate the effects of the operating system patches on performance. Certain figures that have been cited are alarming (up to a 30% or even 50% performance hit), while others are very minimal.


== What is Compute Canada doing about it ? ==
<!--T:4-->
Teams managing the Compute Canada clusters are acting digilently to update their servers as needed and as patches are being released by various vendors. Many servers have already been patched, but some may require more updates as vendors release new patches.
Tasks which involve a lot of input/output (reading and writing files) seem to be most heavily affected. Examples include databases, or file transfers (e.g. rsync). Most high performance computing jobs should be minimally affected since the vast majority of the run time is spent computing rather than doing input and output. Different processor generations are also affected to different degrees, with the most notable performance degradation reported for older processors.


== What should I do about it ? ==
<!--T:5-->
Security-wise, please rest assured that Compute Canada team members are taking every action possible to ensure that systems we run are secure. If you are operating your own virtual machine in our cloud, you are however responsible for updating its operating system to include the latest security patches (see next subsection).  
In the References section below you will find links to some recent performance comparisons. Keep in mind that these were not necessarily run on hardware and operating systems similar to what Compute Canada clusters are running.


Performance-wise, if you believe that your application may be severely impacted by the security patches, please contact our [[Technical support]] team. We encourage you to bring forward comparative performance numbers of your application (job run time after and before for example). Keep in mind however that mitigating the performance impact of the security patches is likely to require some modification to the code you are running, and may not always be possible.
== What is Compute Canada doing about it ? == <!--T:6-->
All vulnerable equipment operated by Compute Canada has been patched. If and when vendors release new patches, these will also be applied.


== What should I do about it ? == <!--T:7-->
Security-wise, please rest assured that Compute Canada team members are taking every action possible to ensure that systems we run are secure. '''If you are operating your own virtual machine''' in our cloud, you are however responsible for updating its operating system to include the latest security patches (see next subsection).


=== I have a virtual machine running on the Compute Canada Cloud ===
<!--T:8-->
Specific instructions to updating virtual machines will be forthcoming shortly.
Performance-wise, if you believe that your application may be severely impacted by the security patches, please contact our [[Technical support]] team. We encourage you to bring forward comparative performance numbers of your application (job run times before and after the announcement, for example). Keep in mind however that mitigating the performance impact of the security patches is likely to require some modification to the code you are running, and may not always be possible.


== References ==
=== I have a virtual machine running on the Compute Canada Cloud === <!--T:9-->
# Other general information about Spectre and Meltdown is available on the [https://www.us-cert.gov/ncas/alerts/TA18-004A US-CERT web site].
Update your virtual machine's operating system to the latest version frequently to ensure it has the latest security patches to address these bugs. See [[Security_considerations_when_running_a_VM#Updating_your_VM|updating your VM]] for specific instructions on how to update Linux VMs.
#* Includes comprehensive links to vendor patch sites.
 
# [https://www.phoronix.com/scan.php?page=article&item=linux-415-x86pti&num=2 Initial Benchmarks Of The Performance Impact Resulting From Linux's x86 Security Changes]
== References == <!--T:10-->
# [https://www.phoronix.com/scan.php?page=article&item=linux-more-x86pti&num=1 Further Analyzing The Intel CPU "x86 PTI Issue" On More Systems]
* Other general information about Spectre and Meltdown is available on the [https://www.us-cert.gov/ncas/alerts/TA18-004A US-CERT web site], which includes comprehensive links to vendor patch sites.
# [https://medium.com/implodinggradients/meltdown-c24a9d5e254e The Meltdown bug and the KPTI patch: How does it impact ML performance?]
* [https://www.phoronix.com/scan.php?page=article&item=linux-415-x86pti&num=2 Initial Benchmarks Of The Performance Impact Resulting From Linux's x86 Security Changes]
* [https://www.phoronix.com/scan.php?page=article&item=linux-more-x86pti&num=1 Further Analyzing The Intel CPU "x86 PTI Issue" On More Systems]
* [https://medium.com/implodinggradients/meltdown-c24a9d5e254e The Meltdown bug and the KPTI patch: How does it impact ML performance?]
* [https://www.ellexus.com/wp-content/uploads/2018/01/180107-Meltdown-and-Spectre-white-paper.pdf Ellexus whitepaper explaining HPC performance issues]
* [https://security.web.cern.ch/security/advisories/spectre-meltdown/spectre-meltdown.shtml Advisory and tools from CERN Computer Security group]
* [https://access.redhat.com/articles/3311301 Controlling the Performance Impact of Microcode and Security Patches for CVE-2017-5754 CVE-2017-5715 and CVE-2017-5753 using Red Hat Enterprise Linux Tunables]
* [https://access.redhat.com/labs/speculativeexecution/ Red Hat's Spectre and Meltdown detector tool]
* [https://arxiv.org/pdf/1801.04329.pdf Effect of Meltdown and Spectre Patches on the Performance of HPC Applications]
</translate>

Latest revision as of 14:41, 12 March 2018

Other languages:

Meltdown and Spectre are bugs related to speculative execution in a variety of CPU architectures developed during the past ten to fifteen years and which affect in particular processors from Intel and AMD, including those in use on Compute Canada clusters. A detailed discussion of the two bugs can be found on this page. Compute Canada personnel have patched systems deemed sensitive to these vulnerabilities.

What are the impacts ?

Availability impacts

Updates to patch the vulnerabilities required updating the operating system and rebooting the nodes. For compute nodes this was typically done in a rolling fashion, was largely transparent to users, and is now complete.

Updates were applied at Graham between 2018 January 5 and January 31. Most nodes were updated by January 13.

Performance impacts

Many groups around the world have run benchmarks to evaluate the effects of the operating system patches on performance. Certain figures that have been cited are alarming (up to a 30% or even 50% performance hit), while others are very minimal.

Tasks which involve a lot of input/output (reading and writing files) seem to be most heavily affected. Examples include databases, or file transfers (e.g. rsync). Most high performance computing jobs should be minimally affected since the vast majority of the run time is spent computing rather than doing input and output. Different processor generations are also affected to different degrees, with the most notable performance degradation reported for older processors.

In the References section below you will find links to some recent performance comparisons. Keep in mind that these were not necessarily run on hardware and operating systems similar to what Compute Canada clusters are running.

What is Compute Canada doing about it ?

All vulnerable equipment operated by Compute Canada has been patched. If and when vendors release new patches, these will also be applied.

What should I do about it ?

Security-wise, please rest assured that Compute Canada team members are taking every action possible to ensure that systems we run are secure. If you are operating your own virtual machine in our cloud, you are however responsible for updating its operating system to include the latest security patches (see next subsection).

Performance-wise, if you believe that your application may be severely impacted by the security patches, please contact our Technical support team. We encourage you to bring forward comparative performance numbers of your application (job run times before and after the announcement, for example). Keep in mind however that mitigating the performance impact of the security patches is likely to require some modification to the code you are running, and may not always be possible.

I have a virtual machine running on the Compute Canada Cloud

Update your virtual machine's operating system to the latest version frequently to ensure it has the latest security patches to address these bugs. See updating your VM for specific instructions on how to update Linux VMs.

References