Observing the Unseen: Tools for Software Infrastructure
Name That Service!
Log Detectives: Unmasking Downtime
Function Dysfunction
100

In Google Cloud, this service allows you to set up notifications and automatic responses when your applications and services experience issues, using metrics and uptime checks

Google Alerting and/or Uptime checks

100

This open-source engine streamlines data retrieval in our applications, offering real-time data sync, automatic class generation for GraphQL queries directly from our schema, and an Event API handy for tasks like sending welcome emails to freshly registered single sign-on users.

Hasura

100

?[31merror?[39m: [pip] http://r-shareasia-pelias-pip-svc.r-shareasia-ns.svc.cluster.local:4200/121.3962138/14.2821796: {"timeout":5000,"code":"ECONNABORTED","errno":"ETIME","retries":3}

What's the primary issue that causes the logs to trigger and which services are affected/having issues according to the logs?

Primary Issue: Pelias subcomponent, Pip, is experiencing timeouts due to a high volume of incoming requests.

Affected Services: Pelias

100

What should be your initial suspect to investigate when you encounter an error message labeled KEYCLOAK_SERVER_ERROR with an error code of UNKNOWN while calling updateSSOUsername?

Keycloak

200

In Coda, to find helpful links and resources for troubleshooting when the system is down specifically for R, which page of the document should you consult?

Environment Services and Helpful Links (under Campgrounds doc > Client-Partnerts/Rappler/Web: CG x ShareAsia/ShareAsia Infra & Backend subpage)

200

In the world of serverless PostgreSQL solutions, what is this color-themed technology that offers flexible scaling, automated maintenance, and hassle-free database management?

This is where our database for CG, Travel Game and SSO is served.

Neon / NeonDB

200

{ "detail": { "error": { "http_exception": { "request": { "queryString": "", "path": "/realms/rappler/protocol/openid-connect/certs", "method": "GET", "port": 443, "host": "sso-dev.rappler.com", "responseTimeout": "ResponseTimeoutDefault", "secure": true, "requestHeaders": { "User-Agent": "hasura-graphql-engine/v2.28.0", "Content-Type": "application/json" } }, "message": "Response timeout", "type": "http_exception" } }, "message": null }, "timestamp": "2023-09-27T13:06:03.855+0000", "type": "jwk-refresh-log", "level": "critical" }

What's the primary issue that causes the logs to trigger and which services are affected/having issues according to the logs?

Main Problem: Hasura was unable to authenticate the server responsible for providing the JSON Web Key (JWK) used to verify the provided JWT during REST API Authentication.

Affected Services: Keycloak and Hasura

200

What is the error message you are anticipating when you call the function moderateChannel while providing a value for camp parameter that does not exist?

The error code for this is INVALID_ARGUMENT

INVALID_CAMP

300

The name of this service used to check the error rate your service is returning to the requesting clients. An example of the visualization that can be seen here contains our Kubernetes resource usage per deployment/service that can be viewed via Cluster View and Workload View

Google Monitoring Dashboard

300

In the world of container orchestration, this Google Cloud service simplifies the management of our applications, providing an efficient and scalable platform. From here, we can easily scale the compute power of our instances by adding replicas that will help other instances to cater the requests.

Google Kubernetes Engine / GKE

300

{ "jsonPayload": { "curl_rc": "7", "message": "readiness probe failed", "timestamp": "2023-09-20T09:46:12+00:00" }, "resource": { "type": "k8s_container", "labels": { "project_id": "lighthouse-sandbox", "pod_name": "r-shareasia-pelias-elasticsearch-es-default-0", "cluster_name": "r-shareasia-cluster", "location": "asia-southeast1", "namespace_name": "r-shareasia-ns", "container_name": "elasticsearch" } } }

What's the primary issue that causes the logs to trigger and which services are affected/having issues according to the logs?

Main Problem: Elasticsearch is currently unprepared to handle the forwarded requests originating from Pelias.

Affected Services: Elasticsearch and Pelias

300

What is the error message you are anticipating when you call the function moderateChannel while using an account that is not a moderator nor admin of the camp?

The error code for this is PERMISSION_DENIED

USER_NOT_A_MODERATOR

400

The name of this service used to visualize the internal resource usage of our Homeserver (process and the machine) in form of graphs and what's the usual subdomain of this service whenever we deploy a new instance of Homeserver

Grafana

stats - stats.dev.campin.gg

400

This cloud platform specializes in deploying web applications and static websites with lightning-fast performance, and it's named after the creator's pet cat. It is also developed by the developers of Next.js, an open-source web development framework providing React-based web applications

This is where the Campgrounds Web Client is hosted

Vercel

400

{ "jsonPayload": { "level": "warn", "type": "pg-client", "detail": { "message": "postgres connection failed, retrying(0)." }, "timestamp": "2023-10-02T11:35:33.001+0000" }, "resource": { "type": "k8s_container", "labels": { "location": "asia-southeast1", "pod_name": "r-shareasia-hasura-7845b9d5bc-286bl", "container_name": "hasura", "cluster_name": "r-shareasia-cluster", "project_id": "lighthouse-sandbox", "namespace_name": "r-shareasia-ns" } } }

What's the primary issue that causes the logs to trigger and which services are affected/having issues according to the logs?

Main Problem: Hasura encountered an issue while trying to establish a connection with the PostgreSQL endpoint

Affected Services: NeonDB and Hasura

400

What is the error message you are anticipating when you call the function updateSSOUsername while providing a username that is already in use or the user requesting for a username registration has username registered in the system already?

The error code for this is ALREADY_EXISTS

USERNAME_ALREADY_TAKEN_OR_USER_ALREADY_UPDATED_USERNAME

500


Judging from the provided user interface, where do you believe you can access this dashboard without needing to consult our list of observability (o11y) links in Coda?

Google Compute Engine (Machine instance of CG Homeserver) 

500

An open-source geocoding engine, helps users find specific locations by converting addresses into geographic coordinates. It is free of charge (except the infra cost we need to provision for this service)

This is where we can search for POIs, Travel Goals and Quests while we're interacting with ShareAsia's map 

Pelias

500

POST https://r-shareasia-pelias-elasticsearch-es-internal-http.r-shareasia-ns.svc.cluster.local:9200/pelias/_search?search_type=dfs_query_then_fetch => Client network socket disconnected before secure TLS connection was established

What's the primary issue that causes the logs to trigger and which services are affected/having issues according to the logs?

Main Problem: Pelias was unable to establish a connection with Elasticsearch because it exceeded the maximum number of socket connections on the machine, primarily due to an influx of excessive requests.

Affected Services: Pelias

500

What is the error message you are anticipating when you call the function updateSSOUsername while providing an idToken that is already expired or revoked and cannot be used for authentication anymore?

The error code for this is PERMISSION_DENIED

TOKEN_ALREADY_EXPIRED