Hosted documentation site

Some languages have hosted sites that help you find packages that you need.

  • Rust has docs.rs, which has docs on all the packages that are uploaded to crates.io
  • Go has pkg.go.dev, which crawls dependency trees of the code it knows about, to find and generate documentation for the code it finds
  • Elm has package.elm-lang.org which hosts READMEs of each package
  • Typescript has the Typescript types search, which has links to libraries that has associated type definitions

The shards site is a great start, but it would be great if it could also have documentation for each shard. It seems like a server could run crystal doc on each new version of the shard and then publish the resulting HTML for it to a CDN. In terms of the infrastructure behind this, I can help build such a thing. What do folks think?

3 Likes

There were previous attempts at https://github.com/docrystal , they stopped working at some point. It was an independent effort from the core team.

Some nice uncovered stories are:

  • use the right crystal version to build the docs
  • multiple versions documentation
  • hightlight outdated versions
  • linking between shards
  • Good SEO

I use https://github.com/bcardiff/ghshard to push crystal-db docs of multiple versions as a POC.

2 Likes

Thanks for that pointer @bcardiff. Would anybody be interested in continuing the work? As a newcomer, I know such a documentation site would be tremendously useful for me and I think it would be a really good resource for the community.

I have some ideas about how to do the processing with cheap infrastructure and the hosting for free.

I’ve already sketched some rough plans for such a service. In general it doesn’t seem too complicated to implement, but as @bcardiff said there are some intricate details.
My idea was to start working on this after 1.0 is released in order to focus on compiler/stdlib first.
Especially crystal docs needs some improvement in order to be usable for such a feature.

But I guess it wouldn’t hurt to already start with the planning phase.

3 Likes

Would it be reasonable to start with a list of versions on the shards site, sidestepping generating docs for now? I’m not sure if the shards site already has this, but I couldn’t find it

Not sure which site you’re talking about. There are several sites for discovering shards.

shardbox.org shows releases (I’m the author and provider). The latest ones are prominent on the main page for each shard and an extensive list is available on a sub-page (e.g. https://shardbox.org/shards/crinja/releases).
shards.info also shows releases, but only a few of the latest ones.

I was originally thinking of crystalshards.org, I didn’t know shardbox.org existed until now. It looks very nice. Since it already has release lists, this looks like a good place to generate documentation. Are you open to me opening an issue on one of your repos to start thinking about how to do it?

Actually, I’d probably prefer the discussion to be here. I don’t see doc generation as an integral part of the existing shards database. It should be deeply integrated, but a separate service. It can certainly live in the shardbox org on github and be hosted on shardbox.org if we want, but doesn’t belong in any of the existing repos.

I’m definitely interested in helping! I’ve been thinking about this for some time and I’d probably be able to provide free hosting and storage too.

RE the repository, a new one sounds fine to me @straight-shoota. Do you mind sharing your rough plans? Also, @RX14 do you have any idea what the crystal docs improvements are?

I can dedicate some time during the week to work on this.

Awesome, let’s compare notes =)

I’ve already taken looks at similar projects, particularaly https://docs.rs/ and https://godoc.org/ (although the latter is in the process of being replaced by https://pkg.go.dev/ - I’m not sure whether the internal implementation has changed).

As already mentioned, I wouldn’t couple it tightly with the shards database but some parts like discovery of new releases can be shared.
My general idea was to keep the external interface as small as possible. So the doc build server would just react to requests for documentation. The request contains the location of the repository and an optional git ref. When it encouters a request for a repo it hasn’t seen before and thus is not readily available, it would fetch that repo and build the documentation. This should typically only take a few seconds. On the first request, the result would be empty and ask to try again after a short while.
The build delay can obviously be skipped by requesting the docs on the doc service beforehand, for example while running CI when releasing a new version. That should make the docs already available as the first real requests arrive.

I’d like the docs server internally to only use repositories directly, instead of shard names as on shardbox.org. That gives more freedom and flexibility, because you can also build documentation for repositories not listed on shardbox, for development branches and mirrors.

This basic interface can obviously be wrapped by a more human-friendly front, for example using shard names from shardbox database to identify repositories. Then you can ask for kemal docs instead of github.com/kemalcr/kemal docs.

I’m not entirely settled on the dataformat of the doc build. Storing the completely rendered doc generator output is easy to implemet and you just need to serve static HTML sites. But it also comes with a lot of overhead: all the HTML files of stdlib docs are about 80MB unzipped, the index.json - which also contains all the content but is still not very optimized in terms of duplicates - is only 10MB.
Storing only the content in a data structure and building the output on the fly offers more flexibility. Transparent HTTP caching can still be used to avoid rebuilding the output on every single request.

The core functionality should only be serving structured data (JSON). Not just the entire content of each build’s index.json, but also selectively individual namespaces.

Generating the HTML frontend on top of that might actually be considered a separate project. At least it’s very easy to separate when you have a simple JSON API in between. So implementation would just be fetching content from the docs server and presenting that in HTML. Probably similar to what crystal docs currently produces, but it could also be very different. Maybe even a jamstack app that directly talks to the docs API and renders HTML on the client. We don’t need to worry about the specifics right now, because with a simple API it’s easy to integrate any web frontend.

Regarding the build process itself, every build should obviously happen in a sandbox. I’m not yet sure about the best approach here, but a solution based on docker/runc would probably be relatively easy to implement.

For most shards, building docs probably shouldn’t be much more than checking out the source code and running crystal docs. But things can get more complicated when a shard needs extra dependencies for building its docs. A couple of standard tools should probably be available in every sandbox, but that probably won’t suffice for all cases. So for some shards we might need to add extra tools/dependencies to the sandbox. But that could be configured per repository.

That’s a few of my thoughts so far. I’ve some more detailed ideas, but that would go too deep for now.

Completely agreed on the general architecture - standalone docs server which takes GET /<git url>/<ref>/<docs path> and serves it. First request builds and caches the docs, and shows you a little progress screen and refreshes every few seconds. Can be preloaded by GETing that URL from other services. Shardbox integration for short URL aliases sounds great.

I’d push off this decision until later. Probably pre-gzipping pages with a custom gzip dictionary would do amazing at low technical cost (plus it can just be served without processing straight from disk). I’m not worried about storage usage at all. HDDs are cheap. Or wasabi, if you want to do it in the cloud.

Disagree, we should just serve crystal docs generated HTML. Any improvements you can make should be pushed upstream. There’s no need to reinvent the wheel. The JSON will still be there for those which want a machine-readable API. I’d really really like to keep this as simple and as dumb as possible.

Copying what carc.in does currently is fine. It’s neither of those.

We’ll cross those bridges when we see which shards fail to compile in practice. I’d be fine with just saying “well you shouldnt make your docs heavily rely on macros” in that case.

1 Like

IMO the main issues with just serving the output of crystal docs is that older releases miss out on new features for the docs app which I find really bad UX.
It also makes it harder to integrate and interconnect docs on the docs server. For example a really useful feature would be hyperlinks for data types defined by dependencies and obviously stdlib docs from everywhere.

I initially thought so, too. But carc.in is build on playpen for sandboxing. That project is no longer actively maintained. That’s explicitly stated in the description and the repo tiself hasn’t been updated since 4 years ago. That might be somewhat arkwardly acceptable for carc.in… I don’t know. But starting a new project on an unmaintained security component sounds like an aweful plan.

A similar alternative would be bubblewrap. Forgot to mention that one.

I really don’t like the complexity implications of this for a first stab. If we want to do better in the future, that’s fine, but I don’t think it should be in the initial POC

Ah, I know some properly hardened alternatives like firejail though.

Of course, a PoC can certainly work with the static crystal docs HTML first :+1:

We would need to properly define (and clean up) the data format produced for index.json anyways. This is currently a real mess because it just outputs stuff as it was represented internally. It was a quick and easy implementation without putting much thought on data architecture.

I’d also refactor crystal docs to use this data format as the internal interface between the parser and output generator. That decoupling makes it easier to use alternative frontends and also pipe content from storage through the generator to build HTML. That would allow using a newer version of the docs generator to produce the user interface when the original source code would not be compatible with the new compiler version

Yes please! If we add this ability to render new HTML for old docs we should make that an upstream feature.

Feedback on https://github.com/crystal-lang/crystal/issues/6947 would be welcome =)

Looks like I missed a lot! I’ll try to add as many comments as I can

Awesome! I have experience in the Go ecosystem. The latter implementation is certainly changed, but they have the same idea.

One idea is to have a single interface that shardbox can expose, like a stream of new shards and/or versions, and have the docs server consume it (the pkg.go.dev service does this). That would allow the docs site to pre-warm its storage, without requiring shard authors to add something to their CI.

+1 from me. this has worked in the Go world very well.

+1 as well. It also means we could serve JSON or other formats for IDE plugins

I don’t understand this one. Do we need to compile the code in order to generate docs, or are you thinking we need to build binaries for some other reason?

This would be a great feature to build into a GitHub action or the equivalent in other CI/CD systems. It boils down to a simple curl docs.shardbox.org/github.com/myorg/myrepo/..., but having it packaged up in a simple tool would help tremendously.

Regarding the GET endpoint, I would love if we could simplify it to GET /<git url>, and make the docs server smart enough to generate docs for all git tags. I don’t know enough about crystal docs to know if that’s possible though …

+1, but can you share what hosting environment you’re running this in?

Could we serve the JSON next to the HTML maybe? Something like docs.shardbox.org/github.com/$ORG/$REPO/$TAG.html and docs.shardbox.org/github.com/$ORG/$REPO/$TAG.json maybe?

I’m gathering that docs might use macros, which can present a security concern. Sorry I’m not more knowledgeable about this! Some other decently hardened options could be gvisor or firecracker. The former is used to run Go code at play.golang.org FWIW

Thank you both for hearing out a newbie :grinning: . @RX14 correct me if I’m wrong, but it sounds like you’re thinking about the infrastructure part of this? I think that’s where I do the most good at this point.

Can you or @straight-shoota let me know some details on how shardbox.org is running (maybe dokku on a cloud somewhere, not sure)? I’m totally willing to start writing some of the infrastructure code we’ll need to serve the API, generate docs, etc…

When macros are involved in the code used for doc generation that allows arbitrary code execution.

However, maybe it would be an option to disable the run macro for doc generation. The API docs ideally shouldn’t depend on the environment anyway. So I don’t see much of an actual use case and it would help make the setup simpler. We’d still want some kind of sandboxing though to limit execution resources per build.

That’s not a feature of crystal docs but would need to be implemented in the docs server. I think the process should be, when it is notified about a new repo, it would initialize builds for all releases, not just the requested one. When docs for a new release of an already known repo is requested, it should only build that one.
Docs for non-release git refs should only be built on request.

Yes, shardbox.org is currently deployed via Dokku on a private hetzner VPS. It’s built using the Dockerfile in shardbox-web. It’s a bit of a mess because I wanted to do a static build with alpine, but some of the static libraries needed some extra convincing. It still didn’t fully work out.
This setup has been working great so far. It just started to collect memory over time, so I have a cron task to restart the app daily. I figure that doesn’t hurt anyways.
One thing I’d like to change is separating the worker (which periodically discovers new releases of registered shards) as a separate service. Currently, the worker is deployed as a second process with the web app, so pushing an update to the worker means also updating the web app. That’s not a big issue, but not ideal either.

For what it’s worth I’m going to be deploying a colocated server with 10GbE and probably 12TiB of storage (with room to upgrade) in the US soon. So hosting and storage space shouldn’t be an immediate concern :)

The current docs generator also emits index.json as well as index.html. /<url>/<ref>/ would fetch index.html, but accessing /<url>/<ref>/index.json would provide access for automated tools. We can clean up that URL if we feel it neccesary.