linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author	Files	Lines
2021-08-06	KVM: Cache the last used slot index per vCPU	David Matlack	2	-1/+34
	The memslot for a given gfn is looked up multiple times during page fault handling. Avoid binary searching for it multiple times by caching the most recently used slot. There is an existing VM-wide last_used_slot but that does not work well for cases where vCPUs are accessing memory in different slots (see performance data below). Another benefit of caching the most recently use slot (versus looking up the slot once and passing around a pointer) is speeding up memslot lookups across faults and during spte prefetching. To measure the performance of this change I ran dirty_log_perf_test with 64 vCPUs and 64 memslots and measured "Populate memory time" and "Iteration 2 dirty memory time". Tests were ran with eptad=N to force dirty logging to use fast_page_fault so its performance could be measured. Config \| Metric \| Before \| After ---------- \| ----------------------------- \| ------ \| ------ tdp_mmu=Y \| Populate memory time \| 6.76s \| 5.47s tdp_mmu=Y \| Iteration 2 dirty memory time \| 2.83s \| 0.31s tdp_mmu=N \| Populate memory time \| 20.4s \| 18.7s tdp_mmu=N \| Iteration 2 dirty memory time \| 2.65s \| 0.30s The "Iteration 2 dirty memory time" results are especially compelling because they are equivalent to running the same test with a single memslot. In other words, fast_page_fault performance no longer scales with the number of memslots. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06	KVM: Move last_used_slot logic out of search_memslots	David Matlack	3	-18/+50
	Make search_memslots unconditionally search all memslots and move the last_used_slot logic up one level to __gfn_to_memslot. This is in preparation for introducing a per-vCPU last_used_slot. As part of this change convert existing callers of search_memslots to __gfn_to_memslot to avoid making any functional changes. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06	KVM: Rename lru_slot to last_used_slot	David Matlack	3	-7/+7
	lru_slot is used to keep track of the index of the most-recently used memslot. The correct acronym would be "mru" but that is not a common acronym. So call it last_used_slot which is a bit more obvious. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05	KVM: xen: do not use struct gfn_to_hva_cache	Paolo Bonzini	4	-13/+19
	gfn_to_hva_cache is not thread-safe, so it is usually used only within a vCPU (whose code is protected by vcpu->mutex). The Xen interface implementation has such a cache in kvm->arch, but it is not really used except to store the location of the shared info page. Replace shinfo_set and shinfo_cache with just the value that is passed via KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the initialization value is not zero anymore and therefore kvm_xen_init_vm needs to be introduced. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04	KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf ↵	Like Xu	4	-4/+8
	interfaces Based on our observations, after any vm-exit associated with vPMU, there are at least two or more perf interfaces to be called for guest counter emulation, such as perf_event_{pause, read_value, period}(), and each one will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes more severe when guest use counters in a multiplexed manner. Holding a lock once and completing the KVM request operations in the perf context would introduce a set of impractical new interfaces. So we can further optimize the vPMU implementation by avoiding repeated calls to these interfaces in the KVM context for at least one pattern: After we call perf_event_pause() once, the event will be disabled and its internal count will be reset to 0. So there is no need to pause it again or read its value. Once the event is paused, event period will not be updated until the next time it's resumed or reprogrammed. And there is also no need to call perf_event_period twice for a non-running counter, considering the perf_event for a running counter is never paused. Based on this implementation, for the following common usage of sampling 4 events using perf on a 4u8g guest: echo 0 > /proc/sys/kernel/watchdog echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent for i in `seq 1 1 10` do taskset -c 0 perf record \ -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \ /root/br_instr a done the average latency of the guest NMI handler is reduced from 37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server. Also, in addition to collecting more samples, no loss of sampling accuracy was observed compared to before the optimization. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20210728120705.6855-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org>
2021-08-04	KVM: X86: Optimize zapping rmap	Peter Xu	1	-12/+29
	Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be slow. The easy way is to travers the rmap list, collecting the a/d bits and free the slots along the way. Provide a pte_list_destroy() and do exactly that. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220605.26377-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04	KVM: X86: Optimize pte_list_desc with per-array counter	Peter Xu	1	-13/+22
	Add a counter field into pte_list_desc, so as to simplify the add/remove/loop logic. E.g., we don't need to loop over the array any more for most reasons. This will make more sense after we've switched the array size to be larger otherwise the counter will be a waste. Initially I wanted to store a tail pointer at the head of the array list so we don't need to traverse the list at least for pushing new ones (if without the counter we traverse both the list and the array). However that'll need slightly more change without a huge lot benefit, e.g., after we grow entry numbers per array the list traversing is not so expensive. So let's be simple but still try to get as much benefit as we can with just these extra few lines of changes (not to mention the code looks easier too without looping over arrays). I used the same a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch further speeds up the total fork time of about 4%, which is a total of 33% of vanilla kernel: Vanilla: 473.90 (+-5.93%) 3->15 slots: 366.10 (+-4.94%) Add counter: 351.00 (+-3.70%) [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3 Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220602.26327-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04	KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger	Peter Xu	1	-2/+2
	Currently rmap array element only contains 3 entries. However for EPT=N there could have a lot of guest pages that got tens of even hundreds of rmap entry. A normal distribution of a 6G guest (even if idle) shows this with rmap count statistics: Rmap_Count: 0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 Level=4K: 3089171 49005 14016 1363 235 212 15 7 0 0 0 Level=2M: 5951 227 0 0 0 0 0 0 0 0 0 Level=1G: 32 0 0 0 0 0 0 0 0 0 0 If we do some more fork some pages will grow even larger rmap counts. This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT will be slightly more efficient; but still not too large so less waste when array not full. It should not affecting EPT=Y since EPT normally only has zero or one rmap entry for each page, so no array is even allocated. With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch speeds up fork time of about 29%. Before: 473.90 (+-5.93%) After: 366.10 (+-4.94%) [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3 Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-6-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03	KVM: const-ify all relevant uses of struct kvm_memory_slot	Hamza Mahfooz	6	-36/+34
	As alluded to in commit f36f3f2846b5 ("KVM: add "new" argument to kvm_arch_commit_memory_region"), a bunch of other places where struct kvm_memory_slot is used, needs to be refactored to preserve the "const"ness of struct kvm_memory_slot across-the-board. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Message-Id: <20210713023338.57108-1-someguy@effective-light.com> [Do not touch body of slot_rmap_walk_init. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03	KVM: Don't take mmu_lock for range invalidation unless necessary	Paolo Bonzini	1	-13/+12
	Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. This is possible now that memslot updates are blocked from range_start() to range_end(); that ensures that lock elision happens in both or none, and therefore that mmu_notifier_count updates (which must occur while holding mmu_lock for write) are always paired across start->end. Based on patches originally written by Ben Gardon. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03	KVM: Block memslot updates across range_start() and range_end()	Paolo Bonzini	3	-4/+65
	We would like to avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. Because mmu_notifier_count must be modified while holding mmu_lock for write, and must always be paired across start->end to stay balanced, lock elision must happen in both or none. Therefore, in preparation for this change, this patch prevents memslot updates across range_start() and range_end(). Note, technically flag-only memslot updates could be allowed in parallel, but stalling a memslot update for a relatively short amount of time is not a scalability issue, and this is all more than complex enough. A long note on the locking: a previous version of the patch used an rwsem to block the memslot update while the MMU notifier run, but this resulted in the following deadlock involving the pseudo-lock tagged as "mmu_notifier_invalidate_range_start". ====================================================== WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE ------------------------------------------------------ qemu-system-x86/3069 is trying to acquire lock: ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] which lock already depends on the new lock. This corresponds to the following MMU notifier logic: invalidate_range_start take pseudo lock down_read() () release pseudo lock invalidate_range_end take pseudo lock () up_read() release pseudo lock At point () we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (*) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the shared* rwsem readers into shared recursive readers), so open-code the wait using a readers count and a spinlock. This also allows handling blockable and non-blockable critical section in the same way. Losing the rwsem fairness does theoretically allow MMU notifiers to block install_new_memslots forever. Note that mm/mmu_notifier.c's own retry scheme in mmu_interval_read_begin also uses wait/wake_up and is likewise not fair. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: nSVM: remove useless kvm_clear_*_queue	Paolo Bonzini	1	-5/+0
	For an event to be in injected state when nested_svm_vmrun executes, it must have come from exitintinfo when svm_complete_interrupts ran: vcpu_enter_guest static_call(kvm_x86_run) -> svm_vcpu_run svm_complete_interrupts // now the event went from "exitintinfo" to "injected" static_call(kvm_x86_handle_exit) -> handle_exit svm_invoke_exit_handler vmrun_interception nested_svm_vmrun However, no event could have been in exitintinfo before a VMRUN vmexit. The code in svm.c is a bit more permissive than the one in vmx.c: if (is_external_interrupt(svm->vmcb->control.exit_int_info) && exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR && exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH && exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI) but in any case, a VMRUN instruction would not even start to execute during an attempted event delivery. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Preserve guest's CR0.CD/NW on INIT	Sean Christopherson	1	-1/+13
	Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as defined by both Intel's SDM and AMD's APM. Note, current versions of Intel's SDM are very poorly written with respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT" quite clearly lists power-up, RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM then attempts to qualify CD/NW behavior in a footnote: 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits are cleared. Presumably that footnote is only meant for INIT, as the RESET case and especially the power-up case are rather non-sensical. Another footnote all but confirms that: 6. Internal caches are invalid after power-up and RESET, but left unchanged with an INIT. Bare metal testing shows that CD/NW are indeed preserved on INIT (someone else can hack their BIOS to check RESET and power-up :-D). Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET	Sean Christopherson	1	-3/+0
	Drop redundant clears of vcpu->arch.hflags in init_vmcb() since kvm_vcpu_reset() always clears hflags, and it is also always zero at vCPU creation time. And of course, the second clearing in init_vmcb() was always redundant. Suggested-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-46-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Emulate #INIT in response to triple fault shutdown	Sean Christopherson	2	-3/+8
	Emulate a full #INIT instead of simply initializing the VMCB if the guest hits a shutdown. Initializing the VMCB but not other vCPU state, much of which is mirrored by the VMCB, results in incoherent and broken vCPU state. Ideally, KVM would not automatically init anything on shutdown, and instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force userspace to explicitly INIT or RESET the vCPU. Even better would be to add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown (and SMI on Intel CPUs). But, that ship has sailed, and emulating #INIT is the next best thing as that has at least some connection with reality since there exist bare metal platforms that automatically INIT the CPU if it hits shutdown. Fixes: 46fe4ddd9dbb ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()	Sean Christopherson	1	-15/+13
	Move VMWRITE sequences in vmx_vcpu_reset() guarded by !init_event into init_vmcs() to make it more obvious that they're, uh, initializing the VMCS. No meaningful functional change intended (though the order of VMWRITEs and whatnot is different). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-44-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT	Sean Christopherson	1	-2/+0
	Drop a call to vmx_clear_hlt() during vCPU INIT, the guest's activity state is unconditionally set to "active" a few lines earlier in vmx_vcpu_reset(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function	Sean Christopherson	2	-35/+22
	Consolidate all of the dynamic MSR bitmap adjustments into vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect that it is x2APIC specific. If KVM gains more cases of dynamic MSR pass-through, odds are very good that those new cases will be better off with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL. Attempting to handle all updates in a common helper did more harm than good, as KVM ended up collecting a large number of useless "updates". Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode	Sean Christopherson	1	-1/+0
	Don't bother initializing msr_bitmap_mode to 0, all of struct vcpu_vmx is zero initialized. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is changed	Sean Christopherson	1	-1/+0
	Drop an explicit call to update the x2APIC MSRs when the userspace MSR filter is modified. The x2APIC MSRs are deliberately exempt from userspace filtering. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-40-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions	Sean Christopherson	3	-8/+1
	Drop unnecessary MSR bitmap updates during nested transitions, as L1's APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows, and L2's MSR bitmap is managed separately. In the unlikely event that L1 is pathological and loads APIC_BASE via the VM-Exit load list, KVM will handle updating the bitmap in its normal WRMSR flows. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-39-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT	Sean Christopherson	1	-3/+0
	Remove an unnecessary MSR bitmap refresh during vCPU RESET/INIT. In both cases, the MSR bitmap already has the desired values and state. At RESET, the vCPU is guaranteed to be running with x2APIC disabled, the x2APIC MSRs are guaranteed to be intercepted due to the MSR bitmap being initialized to all ones by alloc_loaded_vmcs(), and vmx->msr_bitmap_mode is guaranteed to be zero, i.e. reflecting x2APIC disabled. At INIT, the APIC_BASE MSR is not modified, thus there can't be any change in x2APIC state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-38-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86	Sean Christopherson	3	-15/+8
	Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to common x86. VMX and SVM now have near-identical sequences, the only difference being that VMX updates the exception bitmap. Updating the bitmap on SVM is unnecessary, but benign. Unfortunately it can't be left behind in VMX due to the need to update exception intercepts after the control registers are set. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT	Sean Christopherson	1	-2/+2
	When emulating vCPU INIT, do not unconditionally refresh the list of user return MSRs that need to be loaded into hardware when running the guest. Unconditionally refreshing the list is confusing, as the vast majority of MSRs are not modified on INIT. The real motivation is to handle the case where an INIT during long mode obviates the need to load the SYSCALL MSRs, and that is handled as needed by vmx_set_efer(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-36-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Refresh list of user return MSRs after setting guest CPUID	Sean Christopherson	1	-0/+2
	After a CPUID update, refresh the list of user return MSRs that are loaded into hardware when running the vCPU. This is necessary to handle the oddball case where userspace exposes X86_FEATURE_RDTSCP to the guest after the vCPU is running. Fixes: 0023ef39dc35 ("kvm: vmx: Set IA32_TSC_AUX for legacy mode guests") Fixes: 4e47c7a6d714 ("KVM: VMX: Add instruction rdtscp support for guest") Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Skip pointless MSR bitmap update when setting EFER	Sean Christopherson	1	-9/+10
	Split setup_msrs() into vmx_setup_uret_msrs() and an open coded refresh of the MSR bitmap, and skip the latter when refreshing the user return MSRs during an EFER load. Only the x2APIC MSRs are dynamically exposed and hidden, and those are not affected by a change in EFER. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INIT	Sean Christopherson	2	-1/+1
	Move code to stuff vmcb->save.dr6 to its architectural init value from svm_vcpu_reset() into sev_es_sync_vmsa(). Except for protected guests, a.k.a. SEV-ES guests, vmcb->save.dr6 is set during VM-Enter, i.e. the extra write is unnecessary. For SEV-ES, stuffing save->dr6 handles a theoretical case where the VMSA could be encrypted before the first KVM_RUN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INIT	Sean Christopherson	1	-3/+0
	Drop direct writes to vmcb->save.cr4 during vCPU RESET/INIT, as the values being written are fully redundant with respect to svm_set_cr4(vcpu, 0) a few lines earlier. Note, svm_set_cr4() also correctly forces X86_CR4_PAE when NPT is disabled. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INIT	Sean Christopherson	1	-6/+1
	Hoist svm_set_cr0() up in the sequence of register initialization during vCPU RESET/INIT, purely to match VMX so that a future patch can move the sequences to common x86. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-31-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: nVMX: Don't evaluate "emulation required" on nested VM-Exit	Sean Christopherson	3	-13/+11
	Use the "internal" variants of setting segment registers when stuffing state on nested VM-Exit in order to skip the "emulation required" updates. VM-Exit must always go to protected mode, and all segments are mostly hardcoded (to valid values) on VM-Exit. The bits of the segments that aren't hardcoded are explicitly checked during VM-Enter, e.g. the selector RPLs must all be zero. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Skip emulation required checks during pmode/rmode transitions	Sean Christopherson	1	-6/+12
	Don't refresh "emulation required" when stuffing segments during transitions to/from real mode when running without unrestricted guest. The checks are unnecessary as vmx_set_cr0() unconditionally rechecks "emulation required". They also happen to be broken, as enter_pmode() and enter_rmode() run with a stale vcpu->arch.cr0. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Process CR0.PG side effects after setting CR0 assets	Sean Christopherson	1	-11/+12
	Move the long mode and EPT w/o unrestricted guest side effect processing down in vmx_set_cr0() so that the EPT && !URG case doesn't have to stuff vcpu->arch.cr0 early. This also fixes an oddity where CR0 might not be marked available, i.e. the early vcpu->arch.cr0 write would appear to be in danger of being overwritten, though that can't actually happen in the current code since CR0.TS is the only guest-owned bit, and CR0.TS is not read by vmx_set_cr4(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0	Sean Christopherson	1	-3/+3
	Skip the MMU permission_fault() check if paging is disabled when verifying the cached MMIO GVA is usable. The check is unnecessary and can theoretically get a false positive since the MMU doesn't zero out "permissions" or "pkru_mask" when guest paging is disabled. The obvious alternative is to zero out all the bitmasks when configuring nonpaging MMUs, but that's unnecessary work and doesn't align with the MMU's general approach of doing as little as possible for flows that are supposed to be unreachable. This is nearly a nop as the false positive is nothing more than an insignificant performance blip, and more or less limited to string MMIO when L1 is running with paging disabled. KVM doesn't cache MMIO if L2 is active with nested TDP since the "GVA" is really an L2 GPA. If L2 is active without nested TDP, then paging can't be disabled as neither VMX nor SVM allows entering the guest without paging of some form. Jumping back to L1 with paging disabled, in that case direct_map is true and so KVM will use CR2 as a GPA; the only time it doesn't is if the fault from the emulator doesn't match or emulator_can_use_gpa(), and that fails only on string MMIO and other instructions with multiple memory operands. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is disabled	Sean Christopherson	1	-2/+5
	Tweak the logic for grabbing vmcs.GUEST_CR3 in vmx_cache_reg() to look directly at the execution controls, as opposed to effectively inferring the controls based on vCPUs. Inferring the controls isn't wrong, but it creates a very subtle dependency between the caching logic, the state of vcpu->arch.cr0 (via is_paging()), and the behavior of vmx_set_cr0(). Using the execution controls doesn't completely eliminate the dependency in vmx_set_cr0(), e.g. neglecting to cache CR3 before enabling interception would still break the guest, but it does reduce the code dependency and mostly eliminate the logical dependency (that CR3 loads are intercepted in certain scenarios). Eliminating the subtle read of vcpu->arch.cr0 will also allow for additional cleanup in vmx_set_cr0(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'em	Sean Christopherson	1	-9/+37
	Keep CR3 load/store exiting enable as needed when running L2 in order to honor L1's desires. This fixes a largely theoretical bug where L1 could intercept CR3 but not CR0.PG and end up not getting the desired CR3 exits when L2 enables paging. In other words, the existing !is_paging() check inadvertantly handles the normal case for L2 where vmx_set_cr0() is called during VM-Enter, which is guaranteed to run with paging enabled, and thus will never clear the bits. Removing the !is_paging() check will also allow future consolidation and cleanup of the related code. From a performance perspective, this is all a nop, as the VMCS controls shadow will optimize away the VMWRITE when the controls are in the desired state. Add a comment explaining why CR3 is intercepted, with a big disclaimer about not querying the old CR3. Because vmx_set_cr0() is used for flows that are not directly tied to MOV CR3, e.g. vCPU RESET/INIT and nested VM-Enter, it's possible that is_paging() is not synchronized with CR3 load/store exiting. This is actually guaranteed in the current code, as KVM starts with CR3 interception disabled. Obviously that can be fixed, but there's no good reason to play whack-a-mole, and it tends to end poorly, e.g. descriptor table exiting for UMIP emulation attempted to be precise in the past and ended up botching the interception toggling. Fixes: fe3ef05c7572 ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()	Sean Christopherson	1	-23/+17
	Move the CR0/CR3/CR4 shenanigans for EPT without unrestricted guest back into vmx_set_cr0(). This will allow a future patch to eliminate the rather gross stuffing of vcpu->arch.cr0 in the paging transition cases by snapshotting the old CR0. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INIT	Sean Christopherson	1	-4/+1
	Remove a bogus write to vcpu->arch.cr0 that immediately precedes vmx_set_cr0() during vCPU RESET/INIT. For RESET, this is a nop since the "old" CR0 value is meaningless. But for INIT, if the vCPU is coming from paging enabled mode, crushing vcpu->arch.cr0 will cause the various is_paging() checks in vmx_set_cr0() to get false negatives. For the exit_lmode() case, the false negative is benign as vmx_set_efer() is called immediately after vmx_set_cr0(). For EPT without unrestricted guest, the false negative will cause KVM to unnecessarily run with CR3 load/store exiting. But again, this is benign, albeit sub-optimal. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guest	Sean Christopherson	1	-9/+5
	Opt-in to forcing CR0.WP=1 for shadow paging, and stop lying about WP being "always on" for unrestricted guest. In addition to making KVM a wee bit more honest, this paves the way for additional cleanup. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INIT	Sean Christopherson	1	-2/+1
	Drop unnecessary initialization of vmcb->save.rip during vCPU RESET/INIT, as svm_vcpu_run() unconditionally propagates VCPU_REGS_RIP to save.rip. No true functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Move EDX initialization at vCPU RESET to common code	Sean Christopherson	4	-24/+13
	Move the EDX initialization at vCPU RESET, which is now identical between VMX and SVM, into common code. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Consolidate APIC base RESET initialization code	Sean Christopherson	3	-18/+7
	Consolidate the APIC base RESET logic, which is currently spread out across both x86 and vendor code. For an in-kernel APIC, the vendor code is redundant. But for a userspace APIC, KVM relies on the vendor code to initialize vcpu->arch.apic_base. Hoist the vcpu->arch.apic_base initialization above the !apic check so that it applies to both flavors of APIC emulation, and delete the vendor code. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESET	Sean Christopherson	1	-9/+6
	Stuff vcpu->arch.apic_base and apic->base_address directly during APIC reset, as opposed to bouncing through kvm_set_apic_base() while fudging the ENABLE bit during creation to avoid the other, unwanted side effects. This is a step towards consolidating the APIC RESET logic across x86, VMX, and SVM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESET	Sean Christopherson	1	-6/+3
	Write vcpu->arch.apic_base directly instead of bouncing through kvm_set_apic_base(). This is a glorified nop, and is a step towards cleaning up the mess that is local APIC creation. When using an in-kernel APIC, kvm_create_lapic() explicitly sets vcpu->arch.apic_base to MSR_IA32_APICBASE_ENABLE to avoid its own kvm_lapic_set_base() call in kvm_lapic_reset() from triggering state changes. That call during RESET exists purely to set apic->base_address to the default base value. As a result, by the time VMX gets control, the only missing piece is the BSP bit being set for the reset BSP. For a userspace APIC, there are no side effects to process (for the APIC). In both cases, the call to kvm_update_cpuid_runtime() is a nop because the vCPU hasn't yet been exposed to userspace, i.e. there can't be any CPUID entries. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by default	Sean Christopherson	1	-2/+5
	Set the BSP bit appropriately during local APIC "reset" instead of relying on vendor code to clean up at a later point. This is a step towards consolidating the local APIC, VMX, and SVM xAPIC initialization code. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Don't force set BSP bit when local APIC is managed by userspace	Sean Christopherson	1	-3/+0
	Don't set the BSP bit in vcpu->arch.apic_base when the local APIC is managed by userspace. Forcing all vCPUs to be BSPs is non-sensical, and was dead code when it was added by commit 97222cc83163 ("KVM: Emulate local APIC in kernel"). At the time, kvm_lapic_set_base() was invoked if and only if the local APIC was in-kernel (and it couldn't be called before the vCPU created its APIC). kvm_lapic_set_base() eventually gained generic usage, but the latent bug escaped notice because the only true consumer would be the guest itself in the form of an explicit RDMSRs on APs. Out of Linux, SeaBIOS, and EDK2/OVMF, only OVMF consumes the BSP bit from the APIC_BASE MSR. For the vast majority of usage in OVMF, BSP confusion would be benign. OVMF's BSP election upon SMI rendezvous might be broken, but practically no one runs KVM with an out-of-kernel local APIC, let alone does so while utilizing SMIs with OVMF. Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel") Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP	Sean Christopherson	1	-1/+2
	Make vcpu0 the arbitrary owner of the PIT, as was intended when PIT migration was added by commit 2f5997140f22 ("KVM: migrate PIT timer"). The PIT was unintentionally turned into being owned by the BSP by commit c5af89b68abb ("KVM: Introduce kvm_vcpu_is_bsp() function."), and was then unintentionally converted to a shared ownership model when kvm_vcpu_is_bsp() was modified to check the APIC base MSR instead of hardcoding vcpu0 as the BSP. Functionally, this just means the PIT's hrtimer is migrated less often. The real motivation is to remove the usage of kvm_vcpu_is_bsp(), so that more legacy/broken crud can be removed in a future patch. Fixes: 58d269d8cccc ("KVM: x86: BSP in MSR_IA32_APICBASE is writable") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: Remove defunct BSP "update" in local APIC reset	Sean Christopherson	1	-3/+1
	Remove a BSP APIC update in kvm_lapic_reset() that is a glorified and confusing nop. When the code was originally added, kvm_vcpu_is_bsp() queried kvm->arch.bsp_vcpu, i.e. the intent was to set the BSP bit in the BSP vCPU's APIC. But, stuffing the BSP bit at INIT was wrong since the guest can change its BSP(s); this was fixed by commit 58d269d8cccc ("KVM: x86: BSP in MSR_IA32_APICBASE is writable"). In other words, kvm_vcpu_is_bsp() is now purely a reflection of vcpu->arch.apic_base.MSR_IA32_APICBASE_BSP, thus the update will always set the current value and kvm_lapic_set_base() is effectively a nop if the new and old values match. The RESET case, which does need to stuff the BSP for the reset vCPU, is handled by vendor code (though this will soon be moved to common code). No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: x86: WARN if the APIC map is dirty without an in-kernel local APIC	Sean Christopherson	1	-0/+3
	WARN if KVM ends up in a state where it thinks its APIC map needs to be recalculated, but KVM is not emulating the local APIC. This is mostly to document KVM's "rules" in order to provide clarity in future cleanups. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: SVM: Drop explicit MMU reset at RESET/INIT	Sean Christopherson	1	-1/+0
	Drop an explicit MMU reset in SVM's vCPU RESET/INIT flow now that the common x86 path correctly handles conditional MMU resets, e.g. if INIT arrives while the vCPU is in 64-bit mode. This reverts commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset"). Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02	KVM: VMX: Remove explicit MMU reset in enter_rmode()	Sean Christopherson	1	-2/+0
	Drop an explicit MMU reset when entering emulated real mode now that the vCPU INIT/RESET path correctly handles conditional MMU resets, e.g. if INIT arrives while the vCPU is in 64-bit mode. Note, while there are multiple other direct calls to vmx_set_cr0(), i.e. paths that change CR0 without invoking kvm_post_set_cr0(), only the INIT emulation can reach enter_rmode(). CLTS emulation only toggles CR.TS, VM-Exit (and late VM-Fail) emulation cannot architecturally transition to Real Mode, and VM-Enter to Real Mode is possible if and only if Unrestricted Guest is enabled (exposed to L1). This effectively reverts commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real mode") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>