mox/metrics/panic.go

package metrics

import (
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
)

var metricPanic = promauto.NewCounterVec(
	prometheus.CounterOpts{
		Name: "mox_panic_total",
		Help: "Number of unhandled panics, by package.",
	},
	[]string{
		"pkg",
	},
)

type Panic string

const (
	Ctl              Panic = "ctl"
	Import           Panic = "import"
	Serve            Panic = "serve"
	Imapserver       Panic = "imapserver"
	Dmarcdb          Panic = "dmarcdb"
	Mtastsdb         Panic = "mtastsdb"
	Queue            Panic = "queue"
	Smtpclient       Panic = "smtpclient"
	Smtpserver       Panic = "smtpserver"
	Tlsrptdb         Panic = "tlsrptdb"
	Dkimverify       Panic = "dkimverify"
	Spfverify        Panic = "spfverify"
	Upgradethreads   Panic = "upgradethreads"
	Importmanage     Panic = "importmanage"
	Importmessages   Panic = "importmessages"
	Store            Panic = "store"
	Webadmin         Panic = "webadmin"
	Webapi           Panic = "webapi"
	Webmailsendevent Panic = "webmailsendevent"
	Webmail          Panic = "webmail"
	Webmailrequest   Panic = "webmailrequest"
	Webmailquery     Panic = "webmailquery"
	Webmailhandle    Panic = "webmailhandle"
)

func init() {
	// Ensure the panic counts are initialized to 0, so the query for change also picks
	// up the first panic.
	names := []Panic{
		Ctl,
		Import,
		Serve,
		Imapserver,
		Mtastsdb,
		Queue,
		Smtpclient,
		Smtpserver,
		Dkimverify,
		Spfverify,
		Upgradethreads,
		Importmanage,
		Importmessages,
		Webadmin,
		Webmailsendevent,
		Webmail,
		Webmailrequest,
		Webmailquery,
		Webmailhandle,
	}
	for _, name := range names {
		metricPanic.WithLabelValues(string(name)).Add(0)
	}
}

func PanicInc(name Panic) {
	metricPanic.WithLabelValues(string(name)).Inc()
}
mox! 2023-01-30 16:27:06 +03:00			`package metrics`

			`import (`
			`"github.com/prometheus/client_golang/prometheus"`
			`"github.com/prometheus/client_golang/prometheus/promauto"`
			`)`

			`var metricPanic = promauto.NewCounterVec(`
			`prometheus.CounterOpts{`
			`Name: "mox_panic_total",`
			`Help: "Number of unhandled panics, by package.",`
			`},`
			`[]string{`
			`"pkg",`
			`},`
			`)`

initialize metric mox_panic_total with 0, so the alerting rule also catches the first panic for a label increase() and rate() don't seem to assume a previous value of 0 when a vector gets a first value for a label. you would think that an increase() on a first-value mox_panic_total{"..."}=1 would return 1, and similar for rate(), but that doesn't appear to be the behaviour. so we just explicitly initialize the count to 0 for each possible label value. mox has more vector metrics, but panics feels like the most important, and it's too much code to initialize them all, for all combinations of label values. there is probably a better way that fixes this for all cases... 2023-09-15 17:47:17 +03:00			`type Panic string`

			`const (`
			`Ctl Panic = "ctl"`
			`Import Panic = "import"`
			`Serve Panic = "serve"`
			`Imapserver Panic = "imapserver"`
implement outgoing dmarc aggregate reporting in smtpserver, we store dmarc evaluations (under the right conditions). in dmarcdb, we periodically (hourly) send dmarc reports if there are evaluations. for failed deliveries, we deliver the dsn quietly to a submailbox of the postmaster mailbox. this is on by default, but can be disabled in mox.conf. 2023-11-01 19:55:40 +03:00			`Dmarcdb Panic = "dmarcdb"`
initialize metric mox_panic_total with 0, so the alerting rule also catches the first panic for a label increase() and rate() don't seem to assume a previous value of 0 when a vector gets a first value for a label. you would think that an increase() on a first-value mox_panic_total{"..."}=1 would return 1, and similar for rate(), but that doesn't appear to be the behaviour. so we just explicitly initialize the count to 0 for each possible label value. mox has more vector metrics, but panics feels like the most important, and it's too much code to initialize them all, for all combinations of label values. there is probably a better way that fixes this for all cases... 2023-09-15 17:47:17 +03:00			`Mtastsdb Panic = "mtastsdb"`
			`Queue Panic = "queue"`
			`Smtpclient Panic = "smtpclient"`
			`Smtpserver Panic = "smtpserver"`
implement outgoing tls reports we were already accepting, processing and displaying incoming tls reports. now we start tracking TLS connection and security-policy-related errors for outgoing message deliveries as well. we send reports once a day, to the reporting addresses specified in TLSRPT records (rua) of a policy domain. these reports are about MTA-STS policies and/or DANE policies, and about STARTTLS-related failures. sending reports is enabled by default, but can be disabled through setting NoOutgoingTLSReports in mox.conf. only at the end of the implementation process came the realization that the TLSRPT policy domain for DANE (MX) hosts are separate from the TLSRPT policy for the recipient domain, and that MTA-STS and DANE TLS/policy results are typically delivered in separate reports. so MX hosts need their own TLSRPT policies. config for the per-host TLSRPT policy should be added to mox.conf for existing installs, in field HostTLSRPT. it is automatically configured by quickstart for new installs. with a HostTLSRPT config, the "dns records" and "dns check" admin pages now suggest the per-host TLSRPT record. by creating that record, you're requesting TLS reports about your MX host. gathering all the TLS/policy results is somewhat tricky. the tentacles go throughout the code. the positive result is that the TLS/policy-related code had to be cleaned up a bit. for example, the smtpclient TLS modes now reflect reality better, with independent settings about whether PKIX and/or DANE verification has to be done, and/or whether verification errors have to be ignored (e.g. for tls-required: no header). also, cached mtasts policies of mode "none" are now cleaned up once the MTA-STS DNS record goes away. 2023-11-09 19:40:46 +03:00			`Tlsrptdb Panic = "tlsrptdb"`
initialize metric mox_panic_total with 0, so the alerting rule also catches the first panic for a label increase() and rate() don't seem to assume a previous value of 0 when a vector gets a first value for a label. you would think that an increase() on a first-value mox_panic_total{"..."}=1 would return 1, and similar for rate(), but that doesn't appear to be the behaviour. so we just explicitly initialize the count to 0 for each possible label value. mox has more vector metrics, but panics feels like the most important, and it's too much code to initialize them all, for all combinations of label values. there is probably a better way that fixes this for all cases... 2023-09-15 17:47:17 +03:00			`Dkimverify Panic = "dkimverify"`
			`Spfverify Panic = "spfverify"`
			`Upgradethreads Panic = "upgradethreads"`
			`Importmanage Panic = "importmanage"`
			`Importmessages Panic = "importmessages"`
replace http basic auth for web interfaces with session cookie & csrf-based auth the http basic auth we had was very simple to reason about, and to implement. but it has a major downside: there is no way to logout, browsers keep sending credentials. ideally, browsers themselves would show a button to stop sending credentials. a related downside: the http auth mechanism doesn't indicate for which server paths the credentials are. another downside: the original password is sent to the server with each request. though sending original passwords to web servers seems to be considered normal. our new approach uses session cookies, along with csrf values when we can. the sessions are server-side managed, automatically extended on each use. this makes it easy to invalidate sessions and keeps the frontend simpler (than with long- vs short-term sessions and refreshing). the cookies are httponly, samesite=strict, scoped to the path of the web interface. cookies are set "secure" when set over https. the cookie is set by a successful call to Login. a call to Logout invalidates a session. changing a password invalidates all sessions for a user, but keeps the session with which the password was changed alive. the csrf value is also random, and associated with the session cookie. the csrf must be sent as header for api calls, or as parameter for direct form posts (where we cannot set a custom header). rest-like calls made directly by the browser, e.g. for images, don't have a csrf protection. the csrf value is returned by the Login api call and stored in localstorage. api calls without credentials return code "user:noAuth", and with bad credentials return "user:badAuth". the api client recognizes this and triggers a login. after a login, all auth-failed api calls are automatically retried. only for "user:badAuth" is an error message displayed in the login form (e.g. session expired). in an ideal world, browsers would take care of most session management. a server would indicate authentication is needed (like http basic auth), and the browsers uses trusted ui to request credentials for the server & path. the browser could use safer mechanism than sending original passwords to the server, such as scram, along with a standard way to create sessions. for now, web developers have to do authentication themselves: from showing the login prompt, ensuring the right session/csrf cookies/localstorage/headers/etc are sent with each request. webauthn is a newer way to do authentication, perhaps we'll implement it in the future. though hardware tokens aren't an attractive option for many users, and it may be overkill as long as we still do old-fashioned authentication in smtp & imap where passwords can be sent to the server. for issue #58 2024-01-04 15:10:48 +03:00			`Store Panic = "store"`
initialize metric mox_panic_total with 0, so the alerting rule also catches the first panic for a label increase() and rate() don't seem to assume a previous value of 0 when a vector gets a first value for a label. you would think that an increase() on a first-value mox_panic_total{"..."}=1 would return 1, and similar for rate(), but that doesn't appear to be the behaviour. so we just explicitly initialize the count to 0 for each possible label value. mox has more vector metrics, but panics feels like the most important, and it's too much code to initialize them all, for all combinations of label values. there is probably a better way that fixes this for all cases... 2023-09-15 17:47:17 +03:00			`Webadmin Panic = "webadmin"`
add a webapi and webhooks for a simple http/json-based api for applications to compose/send messages, receive delivery feedback, and maintain suppression lists. this is an alternative to applications using a library to compose messages, submitting those messages using smtp, and monitoring a mailbox with imap for DSNs, which can be processed into the equivalent of suppression lists. but you need to know about all these standards/protocols and find libraries. by using the webapi & webhooks, you just need a http & json library. unfortunately, there is no standard for these kinds of api, so mox has made up yet another one... matching incoming DSNs about deliveries to original outgoing messages requires keeping history of "retired" messages (delivered from the queue, either successfully or failed). this can be enabled per account. history is also useful for debugging deliveries. we now also keep history of each delivery attempt, accessible while still in the queue, and kept when a message is retired. the queue webadmin pages now also have pagination, to show potentially large history. a queue of webhook calls is now managed too. failures are retried similar to message deliveries. webhooks can also be saved to the retired list after completing. also configurable per account. messages can be sent with a "unique smtp mail from" address. this can only be used if the domain is configured with a localpart catchall separator such as "+". when enabled, a queued message gets assigned a random "fromid", which is added after the separator when sending. when DSNs are returned, they can be related to previously sent messages based on this fromid. in the future, we can implement matching on the "envid" used in the smtp dsn extension, or on the "message-id" of the message. using a fromid can be triggered by authenticating with a login email address that is configured as enabling fromid. suppression lists are automatically managed per account. if a delivery attempt results in certain smtp errors, the destination address is added to the suppression list. future messages queued for that recipient will immediately fail without a delivery attempt. suppression lists protect your mail server reputation. submitted messages can carry "extra" data through the queue and webhooks for outgoing deliveries. through webapi as a json object, through smtp submission as message headers of the form "x-mox-extra-<key>: value". to make it easy to test webapi/webhooks locally, the "localserve" mode actually puts messages in the queue. when it's time to deliver, it still won't do a full delivery attempt, but just delivers to the sender account. unless the recipient address has a special form, simulating a failure to deliver. admins now have more control over the queue. "hold rules" can be added to mark newly queued messages as "on hold", pausing delivery. rules can be about certain sender or recipient domains/addresses, or apply to all messages pausing the entire queue. also useful for (local) testing. new config options have been introduced. they are editable through the admin and/or account web interfaces. the webapi http endpoints are enabled for newly generated configs with the quickstart, and in localserve. existing configurations must explicitly enable the webapi in mox.conf. gopherwatch.org was created to dogfood this code. it initially used just the compose/smtpclient/imapclient mox packages to send messages and process delivery feedback. it will get a config option to use the mox webapi/webhooks instead. the gopherwatch code to use webapi/webhook is smaller and simpler, and developing that shaped development of the mox webapi/webhooks. for issue #31 by cuu508 2024-04-15 22:49:02 +03:00			`Webapi Panic = "webapi"`
initialize metric mox_panic_total with 0, so the alerting rule also catches the first panic for a label increase() and rate() don't seem to assume a previous value of 0 when a vector gets a first value for a label. you would think that an increase() on a first-value mox_panic_total{"..."}=1 would return 1, and similar for rate(), but that doesn't appear to be the behaviour. so we just explicitly initialize the count to 0 for each possible label value. mox has more vector metrics, but panics feels like the most important, and it's too much code to initialize them all, for all combinations of label values. there is probably a better way that fixes this for all cases... 2023-09-15 17:47:17 +03:00			`Webmailsendevent Panic = "webmailsendevent"`
			`Webmail Panic = "webmail"`
			`Webmailrequest Panic = "webmailrequest"`
			`Webmailquery Panic = "webmailquery"`
			`Webmailhandle Panic = "webmailhandle"`
			`)`

			`func init() {`
			`// Ensure the panic counts are initialized to 0, so the query for change also picks`
			`// up the first panic.`
			`names := []Panic{`
			`Ctl,`
			`Import,`
			`Serve,`
			`Imapserver,`
			`Mtastsdb,`
			`Queue,`
			`Smtpclient,`
			`Smtpserver,`
			`Dkimverify,`
			`Spfverify,`
			`Upgradethreads,`
			`Importmanage,`
			`Importmessages,`
			`Webadmin,`
			`Webmailsendevent,`
			`Webmail,`
			`Webmailrequest,`
			`Webmailquery,`
			`Webmailhandle,`
			`}`
			`for _, name := range names {`
			`metricPanic.WithLabelValues(string(name)).Add(0)`
			`}`
			`}`

			`func PanicInc(name Panic) {`
			`metricPanic.WithLabelValues(string(name)).Inc()`
mox! 2023-01-30 16:27:06 +03:00			`}`