Hosted documentation site

One solution would be to run crystal docs in a sandboxed environment. Not HTTP connections allowed, just read and write from the local project directory.

1 Like

Yes, thatā€™s the general plan with the current behaviour of the docs generator.
But my idea was to maybe change the compiler to not allow run macro (or anything else that could do damage) when generating docs. This would help to make sure that when you can build docs locally, they also build on the docs server. It would also incentivise developers to write better docs because the documented API should really not be influenced by anything except the code thatā€™s in the repository.

I think macro run can create types and methods. If thatā€™s the case, and if they are referenced elsewhere, crystal docs will just not work. Maybe not a big deal.

Yes, but I donā€™t think thatā€™s good practice. Exactly for these reasons. The alternative is using a generator script and storing the generated code in the repo. Thatā€™s much more reliable and better allows to track changes.
I seriously doubt run macro is used much anyway, and even less so for generating public API. If it only generates nodoc code, it doesnā€™t matter for the doc generator.

Agreed that a worker should generate docs behind the scenes and deploy the output to a static file server. I can get some cloud resources to do at least the static file hosting if folks want. Probably the background workers too.

I had been thinking about it over the weekend and wanted to write a proposal down. This is rough and Iā€™m certainly ok with modifying/scrapping it!

CDN

I think Cloudflare is the best choice for a CDN to serve static files. Itā€™s not tied to a major cloud provider, has a free tier, and greatly reduces the load on the origin server below.

Origin for the CDN

The CDN requires an origin server on a miss. @RX14 we can keep your colo server maybe? If you donā€™t want to keep running it I think the next best thing would be to use a cloud blob storage system like S3/Google cloud storage/Azure blob store (disclosure: I work in a MS Azure group. I donā€™t think it matters what we use in this case.

Background workers

This one is the most complex I think. I would split this up into a crawler and a processor that runs crystal docs and sends static files to the origin database.

Crawler

The simplest solution IMO is to run an endless loop somewhere on a managed platform so we donā€™t have to deploy to VMs. Some that Iā€™ve personally used that work well for this kind of background processing

We discussed above that it checks the existing shardbox database. It can keep a HashMap of shards itā€™s already seen and send new ones to the processor via an API call.

If we are going to disallow macros, the crawler can just run crystal docs and we can ignore the processor section below. Otherwise, read on.

Processor

The processor is a server that handles API calls from the crawler and dispatches them to isolated runtimes that run crystal docs and send the static docs to the aforementioned CDN origin.

The Go playground executes arbitrary code with the gVisor secure container runtime and I think we can adopt some of their rough architecture: https://talks.golang.org/2019/playground-v3/playground-v3.slide#24.

We might find some ways to simplify this, but the basic idea of an HTTP server firing up a secure runtime sounds about right to me.

Let me know what you think.

I think the crawler already exists as part of shardbox. A CDN (cloudflare) in front of my server would be how I would deploy it. I donā€™t think thereā€™s much need to split up the frontend from the documentation building prematurely. Worry about horizontal scaling when you need to scale. I donā€™t anticipate it in the near future.

First thing Iā€™d work on would be the sandboxing, write a little crystal script which handles cloning the repo (or pulling), checking out the tag, and building the docs, along with copying them out. Then build a HTTP server to serve these files. Then make it do it on demand. Then make it show a progress bar while itā€™s building. Then integrate with the crawler.

This is just how Iā€™d approach it, off the top of my head.

This is getting into the weeds a little bit though. I think we all agree on approach. The details of the plan never survive contact with the real world.

1 Like

Ah ok, great to know.

Sounds good.

Iā€™m not understanding, how is this different from the crawler? Would this be an addition to it?

Sounds about right :slight_smile:. Let me know what youā€™re thinking of for next steps.

The crawler finds new repos and releases

this docs server - when asked - builds docs for a given repo and release

My personal next steps are largely sorting out my colo server.

I see. Iā€™ll start a docs server that can do crystal docss on demand for now. Should be fairly easy to hook it up to a background docs builder later on

That are already some very detailed ideas. Before thinking about deployment and even CDN weā€™ll need to get at least a working prototoype. Thereā€™s no need to artificially boost complexity right from the start.

Also to put scaling requirements into perspective: On average shardbox currently sees less than 10 new releases per day. The highest number of new releases was 30. Last week saw a total of 100, which is most likely caused by updates for the new Crystal release.
Even considering a generous growth factor: A typical crystal docs run should only take few seconds, so thereā€™d really be much room to grow even with a most simple single-threaded worker loop implementation.

I would target optimizations mostly towards performance, to deliver a build result as fast as possible. This will also help with throughput.

The most time is probably spent checking out the git repo. But that shouldnā€™t be too bad when the repo is locally cached and only needs to pull the delta. Might also consider keeping the workdir checked out. But Iā€™m not sure whether that would improve a lot. Maybe for larger reposā€¦
The shardbox worker already pulls the repos to check for new versions, so it would make sense to share the local git cache to speed up checkout time.

The first step from my perspective would be https://github.com/crystal-lang/crystal/issues/6947
The JSON format needs to be revisted and fixed. I donā€™t want to publish a public service when we know the data format it uses is in dire need to be refactored.
Prototypes can work with the current format, so that doesnā€™t block other efforts.

Slightly OT but just to note, Iā€™m aware and rebuilding on top of another tool is on my todo list, just didnā€™t find the motivation yet. Back when I build carc.in it was actively developed and the maintainer was responsive.

One not so far mentioned alternative I have on my evaluation list is nsjail. A lot of sandboxing/containerization tools overlook the syscall filter part a bit. I will miss playpenā€™s ability to build up a syscall whitelist from a couple of example runs :/

1 Like

The other day I reworked crystal-gobject into a run macro, so that generates tons and tons of public API through it alone :D Keeping all the generated code in a repository was boring and verbose. This approach also doesnā€™t have me worry so much anymore about compatibility of the generated bindings to actually installed library versions.

1 Like

@straight-shoota itā€™s likely that I over-engineered the docs backend. I guess, file it under ideas for the future if theyā€™re ever needed :smile:

I can start with a simple service that checks out code, runs crystal docs, and has a simple k/v store to cache docs. The problem is the sandbox, so Iā€™ll leave the thing as a prototype until we figure that and the format in https://github.com/crystal-lang/crystal/issues/6947. Iā€™ll take a look at https://github.com/crystal-lang/crystal/issues/6947 today.

1 Like