Intro
Making a career change has given me time to reflect on the last 3–4 years I’ve spent working with HCX. This tool isn’t just useful, it’s the backbone of what AVS has become. Without HCX, Azure VMware Solution would be a far tougher sell because migrations would be a nightmare for the average customer. It would make migrations like trying to move a grand piano up a spiral staircase… blindfolded… with one hand tied behind your back.
So, as I looked back, I realized it was time to put together a collection of tips, tricks, and hard-earned lessons. This post will be a bit sporadic, think of it as a campfire chat for those who work with HCX regularly. In many ways, it’s my farewell letter to HCX as I shift my focus to training on Pure Storage products. HCX and I became so close that, as i write this, i feel like I’m writing a note to her from the frontlines on Normandy saying I’ll likely never return. Hopefully, some of these insights will help someone, someday, push past a sticky situation.
The CLI
If you are diving into the CLI of HCX, you probably aren’t in a good place. But hey, that’s what this blog post is about. To get started, SSH needs to be enabled in the admin interface of HCX (https://HCX-IP:9443
), and you log in with the admin
account. The CLI can give you some much-needed information for the following items:
- Service Status: This can be helpful because the boot-up for the 9443 service is extremely slow. That first appliance boot-up can take every bit of 10–15 minutes to get into the admin interface and begin your Connector config. One thing you can do in here, if you’re around that time, is make sure it didn’t get hung. You can run
service status management
to see if it is actively running or not. - Log Files: While you’re in the CLI, you can navigate to the log files. To get there, you need to enter a shell. After logging in, enter
su -
, then enter your password again. You can find log files undercommon/logs/admin/
. If you think the service status is hanging, you can check outapp.log
. This will be the primary file you investigate if you’re running into odd errors.

- Access to Service Mesh Appliances: This is probably the most important—usually because we have to prove to the networking team that a route isn’t in place or a firewall rule is breaking things. In here, we can get into
ccli
, then runlist
to see all of our appliances. We use thego
command followed by the appliance number to access the specific appliance. From there, we can drop into a shell by typingssh
, and now we can use our standard Linux commands to troubleshoot. Some things I used throughout my career include:tcpdump -ni any port 31031
to check if replication traffic for bulk migrations is happening. I wrote about this here.ping
to see if we could reach the vMotion network for HCX vMotions.- Pinging the receiver appliances to verify if the service mesh appliances can communicate.
- vMotion Failures: vMotion failures are usually caused by one of two things—either the connection to Azure is slower than a 56k modem during a storm, or the IX appliance cannot reach the local vMotion interface. We’ll talk about troubleshooting WAN/Interconnect performance in a moment, but how can we check if it’s a local connectivity issue? Drop into a shell on the IX appliance and see if you can ping all of the vMotion interfaces. If you have a lot, you could use something like the script below to quickly check.
#!/bin/bash
# Define the IP range
START=11
END=18
BASE_IP="10.10.10"
# Loop through the IPs and ping them
for i in $(seq $START $END); do
IP="$BASE_IP.$i"
ping -c 1 -W 1 $IP > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "No response from $IP"
fi
done
- Performance Testing: This has changed over different software versions. You can now get this information under the Transport Analytics section. However, I still don’t fully understand what these numbers actually show. When you run a
perftest
at the IX or NE level, you have a number of options to choose from—site
,ipsec
,uplink
, andall
are the ones I used the most. While researching these again, I found some interesting ones I haven’t even tried. I typically usedsite
to demonstrate basic site-to-site performance, but I relied onipsec
as the real number for planning. Ifsite
andipsec
are massively off, that’s usually a sign of a network issue that requires further investigation. It varied from L7 firewall inspection to tunnel within tunnel (within tunnel (within tunnel))… or sometimes, the outbound interfaces for the appliances were sitting on a low-end 1GbE management switch. (Where do your uplinks sit?) Use this as a tool to figure out what’s going on. - The User Guide: I still refer to this user guide (https://hcx.design/wp-content/uploads/2019/11/old_vmware-hcx-ccli-userguide.pdf) even though it’s old. Since VMware moved to Broadcom’s new systems, documentation has become harder to work with.
On The Topic of Network Performance
Cloud migrations: a journey that often begins with optimism and ends with a deep, personal relationship with your network team. Spoiler Alert: The relationship is either one where it’s been strengthened from a successful combat operation or something equivalent to a nasty divorce. Successfully deploying production workloads in the cloud requires extensive network planning and a deep understanding of routing and firewall configurations. This process often highlights the varying levels of expertise and engagement within networking teams. Some teams demonstrate strong problem-solving skills and effectively address complex network challenges, while others may approach cloud networking with the same level of confidence as a fish attempting to climb a tree.
So getting off my soap box, what can I share here? Here are a couple of thoughts:
Use Global Reach: This section’s relevance will likely diminish as AVS evolves. However, for now, your security team might observe the topology view and discover that the Global Reach configuration bypasses the firewalls they’ve established in their hub design. They may opt to forgo Global Reach, forcing traffic through the firewall for centralized inspection. This is a valid approach post-migration. During the migration phase, the combined load of Network Extension and IX traffic can easily overwhelm firewalls, bottlenecking both migration speed and Network Extension performance. Remember, on-premises inspection still provides a security checkpoint. Global Reach ensures the most direct and fastest path for migration purposes.
How do you troubleshoot networking issues in your company?: If the answer is ‘turn it off and on again,’ may the networking gods have mercy on your soul. The most effective approach, by far, involves rolling up your sleeves and utilizing tools like
iperf for Linux and NTTTCP for Windows. These tools enable you to establish test systems across various source and destination points, pinpointing the location of the issue. Frequently, customers would attribute network problems to the AVS NSX-T environment. However, after conducting tests between on-premises and AVS, AVS and Azure, and Azure and on-premises, the root cause often turned out to be a WAN issue. I’ve encountered everything from faulty ISP SFPs to customer network misconfigurations. The key takeaway here is: when faced with dismal network performance, it’s crucial to test across diverse source and destination pairs to isolate the actual problem. Sometimes you may have to fine tune source in destinations to specific physical switches or routers to root out the dragons.
Network Extension Performance: Network Extension is a highly sought-after feature of HCX, as it accelerates migrations by creating a networking ‘bridge’ to Azure VMware Solution (and other multi-cloud VMware offerings). However, as I discussed in my HCX Network Planning blog, certain maximums must be considered. One potential issue is an overrun Network Extension (NE), requiring the expansion of your NEs to balance performance. How do you identify an overrun NE? According to
https://configmax.broadcom.com, the ‘advertised’ total throughput per appliance ranges from 4 to 6+ Gbps per second, or 850 Mbps to 1.65 Gbps per flow. However, monitoring the virtual appliance’s statistics can provide valuable insights into system load. One of the most effective metrics is CPU usage, accessible via VM -> Monitor -> Performance -> Advanced. If NE HA is enabled, you can determine the active appliance under the Network HA screen.



In general, if you are seeing CPU usage consistently at 60%+ it may be time to add another appliance and moving L2E’s over, or just using the new NE appliance for any new L2E’s moving forward.
Migration Tips – Switchover Window
One of HCX’s strengths is its ‘Swiss Army knife’ approach to migration. Unable to extend Layer 2 for any reason? No problem; bulk migrations can modify interface IPs during the process. Need minimal or transparent downtime? We have vMotions with L2Es. In practice, you’ll likely use a combination of these methods to migrate your VMs. Depending on your migration waves, you may want to utilize the Switchover Window with Bulk Migrations. This setting initiates VM replication immediately and maintains a continuous delta replication state until you’re ready to migrate. A useful strategy I’ve found is to set your Switchover Window further out than your intended migration wave. This allows you to adjust the Switchover Window when you’re ready to proceed, as plans can change. Migration waves can be delayed for numerous business reasons; this flexibility prevents unwanted outcomes like unexpected switchovers because someone forgot to update the new migration wave date.




Using the Disaster Recovery Feature
While bulk migrations and vMotions are powerful features, they aren’t infallible. Failed switchovers can occur, often due to network issues or outdated VMtools. A significant challenge with failed migrations is the absence of a true seed point for HCX to resume a migration wave. While this isn’t problematic for a few sub-1TB VMs, it can be extremely painful for 30, 60, or 80 TB VMs. Restarting the process can result in a lengthy wait. To mitigate failures with these ‘large’ VMs, consider using HCX’s Disaster Recovery feature. Although I haven’t discussed this feature extensively, as it’s less commonly used for DR purposes, it’s highly beneficial in scenarios involving very large VMs. HCX Disaster Recovery allows us to establish VM replications. At first glance, it resembles bulk migrations: the VM replicates, progress metrics are displayed, and it enters a delta sync state while awaiting migration. However, the ‘Test Recovery’ feature provides a distinct advantage for large VMs. This feature enables us to create a clone from the replicated data while the VM continues to replicate in the background. Once the clone is spun up with disconnected vNICs, we can verify its state via the vCenter console, manually shut down the source VM, and connect the recovered ‘test’ VM to a new network. Any failures during this process don’t interrupt the protected VM’s ongoing delta sync. This offers greater flexibility than the strict ‘failover window’ of the HCX migration screen. While not necessary for all VMs, it’s invaluable for those exceptionally large ones.




Wrap Up
I’m certain I’ve overlooked some scenarios. I regret not keeping more detailed notes from all the troubleshooting calls I participated in as an Azure VMware Solution CSA. It was an awarding experience, and the expanded networking exposure significantly contributed to my career growth. Did I miss anything? Please leave a comment or tag me on my LinkedIn post. The purpose of sharing these experiences is to assist others during critical moments!