Most monitoring tools only tell you when something is already broken. But what if you could find issues before they become outages? I just published a deep-dive on using AIOps for proactive anomaly detection. This isn't just theory - it's a complete, hands-on tutorial with the working code you need to try it yourself. The stack: - Infrastructure: Defined with modern IAC tools: Terraform and Terragrunt - Observability: Instrumented with the OpenTelemetry - Analysis: Powered by AWS DevOps Guru.Most monitoring tools only tell you when something is already broken. But what if you could find issues before they become outages? I just published a deep-dive on using AIOps for proactive anomaly detection. This isn't just theory - it's a complete, hands-on tutorial with the working code you need to try it yourself. The stack: - Infrastructure: Defined with modern IAC tools: Terraform and Terragrunt - Observability: Instrumented with the OpenTelemetry - Analysis: Powered by AWS DevOps Guru.

Goodbye Manual Monitoring: How AIOps Spots Problems Before You Do

2025/10/22 12:45

Limitation of the Traditional Monitoring

The management of modern distributed applications has become increasingly complex. Using traditional monitoring tools, which rely mainly on manual analysis, is insufficient for ensuring the availability and performance demanded by microservices or serverless topologies.

One of the main problems with traditional monitoring is the high volume and variety of telemetry data generated by IT environments. This includes metrics, logs, and traces, which in an ideal world should be consolidated on a single monitoring dashboard to allow observation of the entire system. Another problem is static thresholds for alarms. Setting them too low will generate a high volume of false positives, while setting them too high will fail to detect significant performance degradation.

To solve these problems, organizations are shifting to an intelligent, automated, and predictive solution known as AIOps. Instead of relying on human operators to manually connect the dots, AIOps platforms are designed to ingest and analyze these vast datasets in real time.

In this article, we will learn how AIOps platforms are capable of proactive anomaly detection—its most fundamental capability - as well as root cause analysis, prediction, and alert generation.


The Technology Stack

The solution detailed in this article is a combination of three synergistic pillars:

  1. A managed AIOps platform that provides analytical intelligence. We will use AWS Guru, which is the core of our solution and acts as its "AIOps brain." AWS Guru is a managed service that leverages machine learning models built and trained by AWS experts. A key design principle is to make AIOps accessible to specialists without special machine learning expertise. Its primary function is to detect operational issues or anomalies and produce high-level insights instead of a stream of raw, uncorrelated alerts. These insights include related log snippets, a detailed analysis with a possible root cause, and actionable steps to diagnose and remediate the issue.
  2. An Open-Standard observability framework that supplies high-quality telemetry data and provides a unified set of APIs, SDKs, and tools to generate, collect, and export it. The importance of OpenTelemetry lies in two principles: standardization and vendor neutrality. The benefit of using OpenTelemetry is that if we want to switch to a different AIOps tool, we can just redirect the telemetry stream.
  3. A Serverless Application that is an example of a modern and dynamic microservice topology.

The complete architectural solution for a proposed telemetry pipeline can be observed on the below diagram.

Practical Implementation

It’s important to understand that AWS Guru does not collect any telemetry data itself but is configured to monitor and continuously analyze resources produced by the Application and identified by specific tags.

To give a reader a better understanding in this section we provide a comprehensive guide on how to implement the proposed solution and further in the Experiment section we will see on how to instrument it. The following structure of a git repository aligns with IAC best practices:

. ├── demo │   ├── envs │   │   └── dev │   │   ├── env.hcl # Environment-specific configuration that sets the environment name │   │   ├── api_gateway │   │   │   └── terragrunt.hcl │   │   ├── devopsguru │   │   │   └── terragrunt.hcl │   │   ├── dynamodb │   │   │   └── terragrunt.hcl │   │   ├── iam │   │   │   └── terragrunt.hcl │   │   └── serverless_app │   │   └── terragrunt.hcl │   └── project.hcl # Project-level configuration defining `app_name_prefix` and `project_name` used across all environments ├── root.hcl # Root Terragrunt configuration that generates AWS provider blocks and configures S3 backend ├── src │   ├── app.py # Lambda handler function with OpenTelemetry instrumentation │   ├── requirements.txt │   └── collector.yaml └── terraform └── modules # Infrastructure Modules ├── api_gateway ├── devopsguru ├── dynamodb └── iam

:::info This Modular (Terragrunt) Approach has the following Benefits:

  • True environment isolation: each environment (dev, prod, etc.) has its own state, config, and outputs.
  • All major AWS resources (Lambda, API Gateway, DynamoDB, IAM, DevOps Guru) are reusable Terraform modules in terraform/modules/.
  • Easy to extend for new AWS services or environments with minimal duplication.

:::

:::tip The full repository can be found here: https://github.com/kirPoNik/aws-aiops-detection-with-guru​

:::

The Lambda function (code in app.py) receives requests from API Gateway, generates an unique ID and put an item to the Dynamo DB Table. It also contains the logic to inject a "gray failure", which will be required for our experiment, see the code snipped with the Key Logic below:

import os import time import random import boto3 import uuid # --- CONFIGURATION FOR GRAY FAILURE SIMULATION --- # This environment variable acts as our feature flag for the experiment INJECT_LATENCY = os.environ.get("INJECT_LATENCY", "false").lower() == "true" MIN_LATENCY_MS = 150 # Minimum artificial latency in milliseconds MAX_LATENCY_MS = 500 # Maximum artificial latency in milliseconds def handler(event, context): """ Handles requests and optionally injects a variable sleep to simulate performance degradation. """ # This is the core logic for our "gray failure" simulation if INJECT_LATENCY: latency_seconds = random.randint(MIN_LATENCY_MS, MAX_LATENCY_MS) / 1000.0 time.sleep(latency_seconds) # The function's primary business logic is to write an item to DynamoDB try: table.put_item( Item={ "id": str(uuid.uuid4()), "created_at": int(time.time()) } ) # ... returns a successful response ... except Exception as e: # ... returns an error response ...

and the collector configuration ( in collector.yaml), that defines pipelines to send traces to AWS X-Ray and metrics to Amazon CloudWatch, see the Key Logic below:

# This file configures the OTel Collector in the ADOT layer exporters: # Send trace data to AWS X-Ray awsxray: # Send metrics to CloudWatch using the Embedded Metric Format (EMF) awsemf: service: pipelines: # The pipeline for traces: receive data -> export to X-Ray traces: receivers: [otlp] exporters: [awsxray] # The pipeline for metrics: receive data -> export to CloudWatch metrics: receivers: [otlp] exporters: [awsemf]

Simulating Failure and Generating Insights

:::info The Experiment section

:::

Step 1: Deploy the Stack

In the demo/envs/dev directory, run the usual commands:

terragrunt init --all terragrunt plan --all terragrunt apply --all

Grab the API endpoint from the output and save it.

export API_URL=$(terragrunt output -json --all \ | jq -r 'to_entries[] | select(.key \ | test("api_endpoint")) | .value.value')

:::tip You need to enable AWS DevOps Guru and wait 15-90 minutes for Discovering applications and resources

:::

Step 2: Establish a Baseline

DevOps Guru needs to learn what "normal" looks like. Let's give it some healthy traffic. We'll use hey, a simple load testing tool perfect for this job.

Run a light load for a few hours. This gives the ML models plenty of data to build a solid baseline.

# Run for 4 hours at 5 requests per second hey -z 4h -q 5 -m POST "$API_URL"

:::tip Use GNU Screen to run this in background

:::

Step 3: Inject the Failure

Now for the fun part. We'll introduce our "gray failure" - a subtle slowdown that a simple threshold alarm would likely miss.

In demo/envs/dev/serverless_app/terragrunt.hcl, add a new INJECT_LATENCY to our Lambda function's environment variable:

environment_variables = { TABLE_NAME = dependency.dynamodb.outputs.table_name AWS_LAMBDA_EXEC_WRAPPER = "/opt/otel-instrument" OPENTELEMETRY_COLLECTOR_CONFIG_URI = "/var/task/collector.yaml" INJECT_LATENCY = "true" # <-- Change this to true }

Apply the change. This quick deployment is an important event that DevOps Guru will notice.

terragrunt apply --all

Step 4: Generate Bad Traffic

Run the same load test again. This time, every request will have that extra, variable delay.

# Run for at least an hour to generate enough bad data hey -z 1h -q 5 -m POST "$API_URL"

Our app is now performing worse than its baseline. Let's see if DevOps Guru noticed.

After 30-60 minutes of bad traffic, an "insight" popped up in the DevOps Guru console.

This is the real value of AIOps. A standard CloudWatch alarm would have just said, "Latency is high." DevOps Guru said, "Latency is high, and it started right after you deployed this change."

Conclusion

This experiment shows a clear path away from reactive firefighting. By pairing a standard observability framework like OpenTelemetry with an AIOps engine like AWS DevOps Guru, we can build systems that help us find and fix problems before they become disasters.

The big takeaway is correlation. The magic wasn't just spotting the latency spike; it was automatically linking it to the deployment. That's the jump from raw data to real insight.

The future of ops isn't about more dashboards. It's about fewer, smarter alerts that tell you what's wrong, why it's wrong, and how to fix it.

Resources

  • Github Repository: https://github.com/kirPoNik/aws-aiops-detection-with-guru
  • AWS DevOps Guru Official Page
  • OpenTelemetry Official Documentation:
  • AWS Distro for OpenTelemetry (ADOT) for Lambda
  • hey - HTTP Load Generator:

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

What is Ethereum’s Fusaka Upgrade? Everything You Need to Know

What is Ethereum’s Fusaka Upgrade? Everything You Need to Know

Over the past few weeks, one of the most talked-about topics within the crypto community has been Ethereum’s Fusaka upgrade. What exactly is this upgrade, and how does it affect the Ethereum blockchain and the average crypto investor? This article will be the only explainer guide you need to understand the details of this upgrade within the Ethereum ecosystem. Why Does Ethereum Undergo Upgrades? To understand what the Fusaka upgrade will achieve, it is essential to comprehend what Ethereum’s upgrades aim to accomplish. The layer-1 Ethereum network was originally designed as a proof-of-work (PoW) blockchain. This implied that miners were actively behind the block mining process. While this consensus mechanism ensured security for the L1 blockchain, it also triggered slower transactions. The Ethereum development team unveiled a detailed roadmap, outlining various upgrades that will fix most of the network’s issues. These problems include its scalability issue, which refers to the network’s ability to process transactions faster. Currently, the Ethereum blockchain processes fewer transactions per second compared to most blockchains using the proof-of-stake (PoS) consensus mechanism. Over the past decade, Ethereum’s developers have implemented most of these upgrades, enhancing the blockchain’s overall performance. Here is a list of the upgrades that Ethereum has undergone: Frontier: July 2015 Frontier Thawing: September 2015 Homestead: March 2016 DAO Fork: July 2016 Tangerine Whistle: October 2016 Spurious Dragon: November 2016 Byzantium: October 2017 Constantinople: February 2019 Petersburg: February 2019 Istanbul: December 2019 Muir Glacier: January 2020 Berlin: April 2021 London: August 2021 Arrow Glacier: December 2021 Gray Glacier: June 2022 The Merge: September 2022 Bellatrix: September 2022 Paris: September 2022 Shanghai: April 2023 Capella: April 2023 Dencun (Cancun-Deneb): March 2024 Pectra (Prague-Electra): May 2025 Most of these upgrades (forks) addressed various Ethereum Improvement Proposals (EIPs) geared towards driving the blockchain’s growth. For instance, the Merge enabled the transition from the PoW model to a proof of stake (PoS) algorithm. This brought staking and network validators into the Ethereum mainnet. Still, this upgrade failed to unlock the much-needed scalability. For most of Ethereum’s existence, it has housed layer-2 networks, which leverage Ethereum’s infrastructure to tackle the scalability issue. While benefiting from the L1 blockchain’s security and decentralization, these L2 networks enable users to execute lightning-fast transactions. Last year’s Dencun upgrade made transacting on layer-2 networks even easier with the introduction of proto-danksharding (EIP-4844). Poised to address the scalability issue, this upgrade introduces data blobs. You can think of these blobs as temporary, large data containers that enable cheaper, yet temporary, storage of transactions on L2 networks. The effect? It reduces gas fees, facilitating cheaper transaction costs on these L2 rollups. The Pectra upgrade, unveiled earlier this year, also included EIPs addressing the scalability issue plaguing the Ethereum ecosystem. The upcoming upgrade, Fusaka, will help the decade-old blockchain network to become more efficient by improving the blob capacity. What is Ethereum’s Fusaka Upgrade? Fusaka is an upgrade that addresses Ethereum’s scalability issue, thereby making the blockchain network more efficient. As mentioned earlier, Fusaka will bolster the blob capacity for layer-2 blockchains, which refers to the amount of temporary data the network can process. This will help facilitate faster transactions on these L2 scaling solutions. It is worth noting that upon Fusaka’s completion, users will be able to save more when performing transactions across layer-2 networks like Polygon, Arbitrum, and Base. The upgrade has no direct positive impact on the L1 blockchain itself. On September 18th, Christine Kim, representing Ethereum core developers, confirmed the launch date for Fusaka via an X post. Following an All Core Developers Consensus (ACDC) call, the developer announced that the Ethereum Fusaka upgrade will take place on December 3rd. Ahead of the upgrade, there will be three public testnets. Fusaka will first be deployed on Holesky around October 1st. If that goes smoothly, it will move to Sepolia on October 14th. Finally, it will be on the Hoodi testnet on October 28th. Each stage provides developers and node operators with an opportunity to identify and address bugs, run stress tests, and verify that the network can effectively handle the new features. Running through all three testnets ensures that by the time the upgrade is ready for mainnet, it will have been thoroughly tested in different environments. Crucial to the Fusaka upgrade are the Blob Parameter Only (BPO) forks, which will enhance the blob capacity without requiring end-users of the blockchain network to undergo any software changes. For several months, the Ethereum development team has been working towards unveiling the BPO-1 and BPO-2 forks. Blockchain developers have pooled resources to develop Fusaka through devnets. Following performances from devnet-5, developers within the ecosystem confirmed that the BPO upgrades will come shortly after the Fusaka mainnet debut. Approximately two weeks after the mainnet launch, on December 17th, the BPO-1 fork will increase the blob target/max from 6/9 to 10/15. Then, two weeks later, on January 7th, 2026, the BPO-2 fork is expected to expand capacity further to a metric of 14/21. Ultimately, the Fusaka upgrade would have doubled the blob capacity, marking a pivotal move for the Ethereum ecosystem. Impact on the Ethereum Ecosystem Admittedly, the Ethereum ecosystem is expected to see more developers and users join the bandwagon. With the introduction of faster and cheaper transactions, developers and business owners can explore more efficient ways to build on the L1 blockchain. This means we can see initiatives like crypto payment solutions and more decentralized finance (DeFi) projects enter the Ethereum bandwagon. Users, on the other hand, will benefit as they execute cheaper on-chain transactions. Despite the benefits from this initiative, some in the crypto community worry about the reduction in Ethereum’s gwei (the smallest unit of the Ether coin). Shortly after the Dencun upgrade, Ethereum’s median gas fee dropped to 1.7 gwei. Fast-forward to the present, and the median gas fee sits at 0.41 gwei, according to public data on Dune. This drop hints at the drastic reduction in gas fees, which could affect those staking their crypto holdings on the L1 blockchain, making it less attractive to stakers. Since the Fusaka upgrade aims to reduce the L2 network gas fee further, some observers may worry that crypto stakers will receive fewer block rewards. Time will tell if the Ethereum development team will explore new incentives for those participating in staking. Will Ether’s Price Pump? There is no guarantee that Ether (ETH) will jump following Fusaka’s launch in December. This is because the second-largest cryptocurrency saw no significant price movement during past major upgrades. According to data from CoinMarketCap, ETH sold for approximately $4,400 at the time of writing. Notably, the coin saw its current all-time high (ATH) of $4,900 roughly a month ago. The price pump was fueled by consistent Ether acquisitions by exchange-traded fund (ETF) buyers and crypto treasury firms. Source: CoinMarketCap Although these upgrades do not guarantee a surge in ETH’s price, they have a lasting impact on the underlying Ethereum blockchain. Conclusion Over the past 10 years, the Ethereum network has had no rest as it constantly ships out new upgrades to make its mainnet more scalable. The Fusaka upgrade aims to make Ethereum layer-2 networks cheaper to use. To ensure its smooth usage, several testnets are lined up. Stay tuned for updates on how Ethereum will be post-Fusaka. The post What is Ethereum’s Fusaka Upgrade? Everything You Need to Know appeared first on Cointab.
Share
2025/09/20 06:57
Share
President Trump’s Golden Bitcoin Statue Unveiled in Washington DC to Mixed Reaction ⋆ ZyCrypto

President Trump’s Golden Bitcoin Statue Unveiled in Washington DC to Mixed Reaction ⋆ ZyCrypto

The post President Trump’s Golden Bitcoin Statue Unveiled in Washington DC to Mixed Reaction ⋆ ZyCrypto appeared on BitcoinEthereumNews.com. Advertisement &nbsp &nbsp A large golden statue of Donald Trump holding a Bitcoin in his hand has been spotted in the nation’s capital. While many assumed that the statue itself was made under the orders of the 47th American president, it was actually constructed under the supervision of a group of cryptocurrency advocates, and they are calling it the “Donald J. Trump Golden Statue” (DJTGST) project. Trump is so far unaware of the development, but could respond to the development soon enough.  Where is the Artwork located? The commissioners describe the 12-foot statue of Trump holding the premier digital currency as a tribute to the “Bitcoin president” who helped the sector considerably. The artwork is of a temporary nature right now, as DC rules don’t permit permanent installations right outside the US Capitol, where it is being displayed right now.  “The installation is designed to ignite conversation about the future of government-issued currency and is a symbol of the intersection between modern politics and financial innovation,” said Hichem Zaghdoudi, who is a representative of the group.  There is no response from the White House till press time.  As usual, X (formerly Twitter) users immediately weighed in on the development and tweeted their opinion. One user posted: Advertisement &nbsp “This is the ugliest statue I have seen in my entire lifetime. I think Squid game doll with a bitcoin would have been more conveying of what this ugly sculpture does.” Another user pipped: “They can’t put Satoshi’s statue there, so this was the next best thing. Only in America smh.” But, perhaps the most nuanced take was from this user:“America never fails to mix politics with theater. a golden trump holding bitcoin outside the Capitol is peak symbolism” A large number of users also resorted to using AI LLMs…
Share
2025/09/20 01:56
Share