VMware vRealize Operations Manager (vROps) Notes

vROps Notes

VMware vRealize Operations Manager; non-specific (cliff) notes


Capacity Planning Types

Allocation Based

Planning for best performance
– Workloads will have enough resources to run fast and with full allocation

Capacity Based

Planning for highest density
– Workloads will run but may be overcomitted
– Suggested for Dev environment usage


Badges

General

  • Datastore and Network IO Usable Capacity numbers are estimated by the system, whereas CPU, Memory, and Diskspace capacities are defined by actual hardware limits

Capacity Analysis Badge

  • Total capacity (or, provisioned capacity) = total amount of resources assigned to a workload; CPU, RAM, Storage, Etc.
  • Limit= User defined capacity limit for a workload
  • Usable Capacity= Capacity left after taking total capacity and setting aside reservations for High Availability and Buffer Reservations
  • Reservation= User defined reserved minimum capacity made available for a workload
  • Entitlement= System determined value between the Reservation (or minimum) and Limit (or Maximum)
  • Demand = Amount of capacity a workload is asking for immediately
  • Usage = Amount of capacity a workload actually receives
  • Resource Contention = Demand > Usage; or The workload needs (Demand) more resources than the system is currently providing (Usage)

Stress Badge

  • Stress = When demand of a workload is at or above 70% (user changable) of the capacity
  • Reccomnded Size Column = The amount of resources actually needed to run the current workload without contention

Workload Badge

  • Workload = An objects near real-time demand for a resource vs. the actual capacity of that resource OR a simple percentage of how much resource an object wants to use vs. how much it can actually get its hands on.
  • Workload is a near instantaneous value, calulated over the past 5 minutes of activity
  • Reference the Workload badge video @ 4:30 for details on how to properly read IOPS bars for this badge

Time Remaining Badge

  • Uses a logarithmic scale… not linear.
  • Uses same policy and metics as Capacity Badge, just displays them in a “Time Remaining” fashion
  • Provisioning Time Buffer = Amount of time it takes to procure and put a new host into service (Default = 30days)
  • Peak Consideration = Badge considers peak loads; if this is turned off, then it considers average loads

Reclaimable Capacity Badge

  • Reclaimable Capacity = Amount of capacity that can be reclaimed without causing stress or performance degredation

Compliance Badge

  • vROps can do simple compliance checks for hosts and VM’s using vSphere hardening guides
  • For more “out of the box” or taylored compliance, use vRealize Configuration Manager

Anomalies Badge

We have alerts for this metric

  • Uses dynamic thresholds created over time
  • Channel or Normal = Area in between upper and lower dynamic thresholds (DT) for each metric, created over time
  • Anomalies = Metrics that pass above or below (outside) the dynamically created channel (normal)
  • Anamolies Score = total number of monitored metrics vs. the number of metrics that are out the dynamic theshold (channel) at a given time
    • The more metrics acting abnormally the higher the score goes
    • Lower is better
  • Click and drag to zoom in on Anomalies graph and the table below will only show anomalies for that time period
  • HT = Hard Threshold
  • DT = Dynamic Threshold

Fault Badge

  • Not much to say here, this badge just shows faults incoming from systems and collectors
  • There is a “cancel” button on the badge page that can be used to remove faults that have been fixed

Alerts

  • Alerts = One or more active symptoms across one or many object
  • Symptom(s): Created based on any metric, badge, or object (cluster, host, vm, datastore, switch, OS, application, etc.)
  • One click remediation optons are available <– need to research this further (see video for some details)
  • Ideas for alert usage:
    • Snapshots and their removal
    • Stress, workload, capacity and anomaly badge flags (once these badges are reigned in and tweaked using policies)

Architecture, Scalability & HA

  • Architecture Video (Node types, etc)

  • Node: a single instance of vROps

    • Each node contains all layers of the application.
    • As the need arises vROps should alert you to expand the cluster, either up or out
    • Nodes can be set up on Windows, Linux, or as a Linux VM Appliance (OVF – running on SLES)
      • at this time node OSs cannot be mixed and must be homogenous throughout the vROps cluster (except for remote collectors)
  • Node Roles:

    • Master: Manages global data for the cluster, also assumes data node role
    • Master Replica: recieves replica data from master node; will take over in case of master failure
    • Data: contains core analytics engine to process incoming data, determine dynamic thresholds, and capacity calulations
    • Remote Collector: collects data across high latency links and seeds data back to the cluster
      • Can run on OS’s that are different form the clusters data nodes
  • User interfaces:

    • Admin UI:
      • runs on all nodes and provides cluster management
      • only used when product UI is not available, during upgrade, or configuring first node
    • Product UI:
      • Runs on all nodes except remote collector nodes
  • High Availability:

    • Redundant copies of data are spread around nodes
    • Master node is replicated to “master replica” node
    • Best practice to deploy Master and Master replica nodes on seperate hardware
      • Perhaps replica should be at DR site
    • When a node fails, the secondary then becomes primary and a new secondary is chosen and replicated with, until the failed node is brought back online; it then assumes its original role after data is synchronized back to it
    • Step by step how to set up HA video

Online Resources