Friday, February 26, 2010

Cisco UCS testing shows throughput constraints, one lab says

Cisco's Unified Computing System (UCS) architecture has severe bandwidth constraints that limit throughput to a fraction of what is promoted by the company, according to a testing lab. The lab also claims that UCS actually works against the necessary automation for a virtualized environment.

An anonymous Cisco competitor commissioned Tolly Group to perform the Cisco UCS testing. Tolly claims that while Cisco promotes UCS blades as having 40 Gbps of capacity, actual throughput enabled during testing was as low as 27 Gbps.

"In recent weeks, we have been asked by one of Cisco's former blade server partners to benchmark the network throughput of Cisco's UCS for an upcoming comparative report. The limitations of Cisco's external switch-based UCS architecture were eye-opening to say the least," Tolly Group founder Kevin Tolly wrote. The full report is scheduled for release Friday.

Analysts and engineers familiar with UCS agree that Cisco's Unified Computing architecture – which includes blades, chassis, fabric interconnect and extender, management software, and network adapters -- constrains bandwidth in ways that technology from competitors may not, but they say the throughput enabled is perfectly suitable to support today's applications. They also say that competitors' products have other drawbacks.

"This is not a real-world instance [for testing]," said one engineer, who runs a Cisco shop but hasn't yet implemented UCS. "No one is pumping out 10 gigabits per second." The engineer's own shop is 85% virtualized, runs on a 4 gigabit uplink backbone and has never had a bottleneck problem, he said.

Cisco UCS testing: Where bandwidth constraints may be found

The Tolly test points to a number of throughput trouble spots in the unified computing architecture, starting with the idea that there are more servers than active uplinks in a chassis and that there is dependence on a clumsy external switch.

A single UCS chassis holds eight servers, each of which has its own 10 GbE converged network adapter, but the chassis' fabric interconnect has only four active uplinks to the top-of-rack switch.

"Even though the blades have a theoretical aggregate capacity of 80 Gbps, they all have to communicate to the top-of-rack UCS 6120 XP Fabric Interconnect switch via the maximum of four 10 GbE uplink ports of the UCS 2104 Fabric Extender," Tolly wrote. "The UCS chassis can accept a second fabric extender, but Cisco documentation makes it clear that this second unit is for failover only and cannot be used to provide 80 Gbps of uplink connectivity."

Is Cisco unified computing architecture static?

Once the throughput is essentially cut in half by "eight servers vying for four available uplinks," the system does not then automatically aggregate the 40 Gbps of bandwidth across the servers as needed, Tolly said in his report.

"Cisco documentation informs us that a given server is 'pinned' to a given uplink port. 'Pinned' as in static, can't move, can't change, maxed out," he wrote.

Having thumbed deeply through Cisco's UCS user guide, Tolly found the following quote halfway through: "The pinning determines which server traffic goes to which server port on the fabric interconnect. This pinning is fixed. You cannot modify it. As a result, you must consider the server location when you determine the appropriate allocation of bandwidth for a chassis."

Not only does the idea of "pinning" go against the very idea of automated virtual machine migration, according to Tolly, it causes further bandwidth limitation.

Tolly first benchmarked the throughput between two physical servers "pinned" to different uplinks with positive results. Out of the maximum of 20 Gbps bi-directional, he measured 16.7 Gbps of application throughput, not counting lower-layer protocols.

But when the same test was conducted between two physical servers that were tied to the same uplink, there was contention between the two servers as traffic transited the top-of-rack switch.

"The aggregate throughput dropped to 9.10 Gbps out of a possible 20 Gbps," Tolly wrote. "Further tests, and common sense, showed that the best throughput would be achieved when two pairs of servers pinned to the four different links communicated 1 to 2, and 3 to 4. That got us about 36 Gbps. That's great, but what about your other four servers?"

When Tolly added two more real servers and requested additional bandwidth, there was significant degradation resulting from the contention.

"By the time we reached the limits of our test requesting an additional 20 Gbps of bandwidth, Cisco's total system throughput had dropped to just 27 Gbps," he wrote.

Will Cisco UCS bandwidth constraints become a real problem?

When Cisco first set out to design UCS, developers didn't anticipate the speed of virtualization uptake and its bandwidth requirements, said Joe Skorupa, research vice president at Gartner. For now, Cisco's UCS far exceeds the needs of most shops.

"As virtualization proceeds over the next 18 months, the bandwidth requirements will increase," he said, adding that eventually Cisco will have to address the problem.

The company would have done better to design around the problem from the start just to avoid these types of claims, Skorupa said. After all, the networking giant is the "new kid on the block," with a battle against competitors that will be exhausting. But he said it's likely that Cisco's next iteration of equipment will address the problem.

Cisco didn't respond to Tolly's testing results, saying: "Cisco did not authorize or actively participate in the Tolly Group test, therefore we cannot accept or validate its conclusions. Further, Cisco cannot comment on test reports that have been paid for by another vendor."

Tolly contacted Cisco for participation in the course of testing, but Cisco declinedCisco UCS testing shows throughput constraints, one lab says

Problem With Network In Fakulti Perubatan Sg. Buluh CTC,IMMB and RC - Internet very slow ! can u fix it ASAP!

One of the challenges of being a ICT Project Manager for this project is the "internet is slow, can you fix it" phone call. Generally, the network is blamed first, but there are many layers that all need to be examined - desktop, network, server, storage, database, active directory, internet service provider etc. For example, a complaint about email slowness can be caused by a lot of factors.

After do some research and analysis on the network configuration i found out 2 factors contribute to the problem "internet is slow, can you fix it". (a) The network configuration configure by CCNP from datacraft configured without proper planning (b) The juniper router configure by Mesiniaga configured without understanding over all scenario of UiTMNet environment.

Since this project still under defect liability period so I'll conduct special meeting to revolve this problems as soon as possible.

The general approach i will used cover three domains. Begin by identifying and defining the problems from a user perspective. This helped to identify issues related to system performance versus non-technical issues that amplified the technical issues and affected user perception of performance, e.g. training, improper usage of the application.

I'll used multiple subject matter experts to focus on the different domains to ensure they had the in-depth knowledge to evaluate each of them.

The three investigation domains and key focus areas within each domain is:

Client
End User Observation & Interviews
Client device performance analysis
Device configuration & Log review
Device specification analysis per application vendor recommendations

Network
WAN link utilization
Device performance analysis
Device configuration & Log review
Packet Loss & Latency analysis
Traffic Analysis

Infrastructure
Server and Storage performance analysis
Device configuration & Log review
Service and Process performance analysis
Device specification analysis per application vendor recommendations

The findings from the assessment did not identify a "magic bullet" issue that caused performance issues, but instead identified multiple smaller issues that combined to impact system performance.

In my experience of troubleshooting complex IT systems, I've found that the comprehensive approach outlined above works very well. Hope to see the result soon...