SciOp The Blog

This Post Goes on the Homepage

2025-09-04T06:43:00+00:00

Many have said “what is going on over there” and “why is the last update from march 2025 when the earth is older than that now.”

In addition to the sciop blog existing, now it is embedded on the frontpage of sciop dot net, a routine feature for a website that has most of its core components relatively squared away.

The Pull Request, first initiated by transorsmth some 3 months ago: https://codeberg.org/Safeguarding/sciop/pulls/404

As with all new development on sciop, it is intended to be configurable by other instances aside from sciop dot net, when such things exist, and can accept multiple atom feeds for e.g. different topics, covering different projects, with different authors, and so on.

So without further ado, a screenshot of text of the other two posts that are also on this website:

A Hint Of Future Plugins

Since this is a very simple feature, and we don’t have much in the way of developer docs at the moment, it seems a good enough example to show the pattern of how most things are implemented in sciop.

It consists of

A config model - UpdateFeedsConfig
Some database models - models/atom.py
A background service to update feeds - services/atom.py
A job wrapper that links the config to the scheduler - update_atom_feeds
An API endpoint to generate an HTML partial - /partials/whatsnew, and
Some templates - templates/macros/whats-new.html

Pretty much “website things.”

These components (and maybe a handful more) are destined to become the basic of a plugin system, where most current functionality is moved into plugins, and most new functionality is implemented as plugins. There is still a bit of work do be done to get us there, mostly in refactoring the existing templates to support plugins modifying and adding components to them¹, and a bit of metaprogramming to wrap API endpoints and handle model migrations, but we have already been writing sciop in separable services with this in mind, so once that is done then that shall be the pattern going forward. As we become a “fediverse app,” the last thing we want to be is “a monolithic fediverse app,” and want to start with a very pluggable design rather than trying to back flexibility into the system later.

For embedding within institutions with rich histories of work patterns and homemade infrastructure, this kind of flexibility and ease of writing custom behavior is essential, and the same is true for supporting different p2p communities. One of the first plugins that we plan on writing after the plugin system lands is integrating external tracker software, where rather than relying on other public trackers, a sciop instance may want to provide its own, embed the tracker link in uploads, generate custom URLs for private torrents to track upload stats, and do all the other fun things one might want from a tracker. We also want to experiment with all the non- or sparsely-implemented FEPs floating around, and will be writing our federation layer as a generic fastapi² overlay that can be useful outside sciop.

Example

To configure feeds, stick something like this in your sciop.yaml

services:
  update_feeds:
    enabled: true
    interval: 10
    feeds:
      - name: sciop
        url: https://blog.sciop.net/feed.xml

with a URL pointing to some atom feed. That’s all that someone using the software would need to do!

That corresponds to boilerplate pydantic models:

class UpdateFeed(BaseModel):
    """
    A single feed to use for "whats new" updates
    """

    url: AnyHttpUrl
    """URL of an Atom feed"""
    name: str
    """
    Short, taglike name to use when displaying posts from this feed.
    """


class UpdateFeedsConfig(JobConfig):
    """
    A set of feeds and service config for updating them
    """

    interval: int = 30
    """
    Frequency (in minutes) to check feed for updates.
    
    If the feed has not been updated in that time, no changes are made to local objects.
    """
    feeds: list[UpdateFeed] | None = None
    """
    Potentially multiple feeds to pull updates from.
    If `feeds` is not provided, service will be disabled.
    """
    max_n_posts: int = 10
    """
    Only keep the last n posts from each configured feed.
    """

Internally, that config is used by function wrapper that schedules the service:

@scheduler.interval(
    minutes=get_config().services.update_feeds.interval,
    enabled=bool(
        get_config().services.update_feeds.enabled and 
        get_config().services.update_feeds.feeds
    ),
)
async def update_atom_feeds() -> None:
    """
    Put documentation in the real thing, but this is merely a simulacrum
    """
    await services.update_feeds()

and then the scheduler will launch a task every interval minutes to update the feed.

Simple as pie. the rest are details.

Appendix - Build and Deploy a Static Site with Codeberg/Forgejo using Webhooks

If you are unsure how one might go about making an atom feed -

We write this very blog with jekyll, generate the feed with jekyll-feed, and use a webhook to trigger a rebuild on push on one of our machines. Total time from push to deployment is ~8 seconds or so.

I wasn’t able to find another blogpost describing this exact process, since the forgejo webhook documentation is a little sparse. So if it saves anyone time…

generate a secret with openssl rand -hex 32
configure a webhook like this:

[
  {
    "id": "build",
    "execute-command" : "/path/to/your/build/command.sh",
    "command-working-directory": "/wherever/your/blog/repo/is",
    "response-message": "Deployed!",
    "trigger-rule": {
      "and": [
        {
          "match": {
            "type": "payload-hmac-sha256",
            "secret": "{YOUR_SECRET}",
            "parameter": {
              "source": "header",
              "name": "X-Forgejo-Signature"
            }
          }
        },
        {
          "match": {
            "type": "value",
            "value": "refs/heads/main",
            "parameter": {
              "source": "payload",
              "name": "ref"
            }
          }
        }
      ]
    }
  }
]

for a build script /path/to/your/build/command.sh like this³:

#!/bin/bash

git pull

/home/webhook/.rbenv/shims/bundle install
/home/webhook/.rbenv/shims/bundle exec jekyll build -d /some/web/directory

configure nginx to forward the URL to the webhook service⁴

server {
  listen 443 ssl;
  server_name your.url.here;

  root /some/web/directory;

  location / {
    try_files $uri $uri/ $uri.html =404;
  }

  location /hooks/ {
    proxy_pass http://localhost:9000/hooks/;
  }
  
  # other stuff for ssl and logging and ratelimiting and etc.
}

Configure a webhook for your repository,
- POSTing JSON
- to your hooks url
- with the secret generated above,
- on push events
- with a branch filter for main.

and ka blammo. push to deploy the blog, and sciop will catch up when it runs its next update.

Now you’re talking to online from your personal computer.

One can already override any template by configuring paths.template_override with a directory of template overrides. E.g. to override src/sciop/templates/pages/datasets.html, one would write their own template and put it in {template_override}/pages/datasets.html . ↩
Or, we may jump ship to litestar, since contributing to fastAPI has proven to be infuriating, and the non-bot PR merge rate can attest to that. ↩
After installing rbenv and ruby, if you aren’t using jekyll then obviously do whatever your thing is to build it obvi ↩
Using certbot to configure ssl ↩

The Modern Webseed

2025-08-29T23:00:00+00:00

How It Works
Adding A Webseed
Why This Is Cool
TODO
Appendix on Malicious Use

BitTorrent is great, but what if you, graceful and mighty but in adverse network conditions, cannot run a torrent client?

Sciop has just the feature for you! As of today¹ you can now add webseeds straight from the website! This means that we can bridge datasets archived in any number of traditional archives, efficiently use existing resources, and bring a new category of institutionalized peers into the swarm.

This work builds on one of our attempts at revitalizing the broader bittorrent tooling ecosystem, where after integrating torrent-models² we can now safely³ and transparently make serverside edits to uploaded torrents.⁴

Stick around after the ad break to learn how webseeds could enrich your life and make you more complete as a person.

How It Works

Webseeds are very simple and require no special behavior on the part of the HTTP server aside from supporting range requests. Where normally the bittorrent client would request pieces from other peers, the client instead requests ranges from the http server. Webseeds can be used along with normal peer swarms, and multiple webseeds can be used at once.

Webseed URLs are constructed like this for multi-file torrents:

{webseed_url}/{torrent_name}/{path}

and like this for single files:

{webseed_url}/{torrent_name}

where torrent_name is the value of the name field in the torrent’s infodict (not the filename). The name field is the name that’s shown in your torrent client and the folder that is created when you download a torrent.⁵

So for a torrent named tentacoli with a file named track-07.mp3, if an http server was hosting that file at https://example.com/desktop/tentacoli/track-7.mp3 then the webseed url would be https://example.com/desktop/.

Adding A Webseed

If you have hosted some data in a torrent somewhere, or are aware of some alternate source for it, you can add it to the torrent from the web interface⁶

When you are logged in to sciop, on an upload page, you will see a button to add a webseed.

If you click it, you can then type in a url

And once you submit it, it will be added to the validation queue

The server will now attempt to validate that the URL works as a webseed by requesting some subset of the pieces and checking them against the piece hashes in the torrent.⁷

If the webseed validates, if the account that added it has the upload permission scope, then it will be added to the torrent! If the account does not have the upload scope, it will enter the moderation queue and only added after manual review by a moderator.

The account that uploaded the torrent retains control of the torrent, so they are able to remove webseeds if they object to them for some reason, and webseeds from accounts without upload permissions are not displayed publicly until they are approved to avoid them being a vandalism vector.

This follows the general soft security moderation pattern of sciop - rather than gatekeeping at the level of account creation, any account can propose a change to be made, but only known accounts automatically have those changes take effect. This allows us to keep the site trustworthy while allowing everyone to participate.

Why This Is Cool

What’s the big deal, it’s just adding a url to a list in a torrent? Like objects in mirrors, webseeds in blogposts are cooler than they appear.

We are not aware of any other trackers that allow post-hoc addition of validated webseeds in a participatory way by people that aren’t the original uploader, but please let us know if we missed something.

Institutional Pipes

As much as we would love for everyone to experience the joy of a torrent swarm, many would-be collaborators have been unable to contribute because their resources are housed in some setting where bittorrent traffic is blocked, or it would otherwise be impossible for them to run a torrent client. This has been a common refrain from our colleagues at academic institutions or archives.

We are not protocol purists, and don’t use bittorrent for bittorrent’s sake, but instead use it to bring all available resources to bear from cute little raspis seeding from a flash drive to enormous Swedish Seedboxes. What are HTTP servers and S3 buckets but very big peers?

Adding webseeds via the web UI provides a new way for those in constrained circumstances to join the swarm by getting the data⁸ and rehosting it on a traditional HTTP server. We also hope to incentivize joining the swarm by providing a little pro-social gamification - having your URL listed as a webseed on sciop lets others know that you care about the preservation of cultural memory.

Bridging Archives

Archives have a problem: there are more than one of them. This is a problem for preserving threatened data, where different archival groups have been patching together dataset storage on zenodo, google drive, globus, and so on; as well as for more conventional archiving, where researchers have to search or upload their work to multiple archives that are mutually incompatible.

Aggregation and indexing overlays address this problem to some degree by collecting potentially multiple links to the potentially multiple places a dataset might be hosted, but they don’t make use of those multiple hosts, allowing the redundancy of resources to improve availability, resilience, and performance. Even if a dataset is hosted in multiple places, I still have to download it from only one of them, so the fastest and largest archives end up shouldering all the costs and smaller archives don’t have a great “path to relevance” - why would i store my data on the slow, unpopular archive?

Treating a torrent as a verifiable shorthand for a dataset and then allowing anyone to add a webseed means that now it is possible to make use of all the available resources for a given dataset - if i have previously archived something to zenodo, or on my special S3 bucket server⁹, and I see someone uploaded it to sciop, I can add in my copy as an additional source. When multiple webseeds are present, bandwidth can be shared between each of them (and the rest of the swarm), decreasing the burden on each individual host.

This allows sciop to take indexing overlays one step further - rather than just aggregating the metadata from multiple hosts, we can aggregate the hosts themselves.

Division of Labor - Scrapers & Seeds

Sciop as a project is about coordinating people with varying resources, expertise, and commitment towards the same goal - it should be possible for someone to wander in off the street¹⁰ and contribute, whether that means spending 5 minutes improving the docs or 5 months scraping alongside us.

We have observed a natural division of labor emerge, where people often fall into one or a few basic roles based on expertise or interest. One of those divisions has been between scrapers and seeders - where some people love to do the work of scraping and downloading, but don’t have the resources to actually store and upload it; and others don’t enjoy scraping but have plenty of spare storage and bandwith.

The pattern of

low-resource scrapers downloading, hashing, and discarding data
uploading a torrent with a webseed, and
other seeders bootstrapping the swarm off the webseed

has proven to be ridiculously effective.

This provides a way for people without seeding resources to effectively call down a swarm of seeders onto an at-risk dataset, who then automatically¹¹ and collaboratively back it up. This process is much more respectful to the hosting servers than everyone scraping their own copy would be, as in the ideal case the webseed needs to only upload the full dataset twice,¹² and then all future downloads have bandwidth supplemented by the swarm. Scraping is additionally deduplicated by our quest system and scraping tools that turn web archiving into an all-play activity, and will be the subject of a future blog post :).

Adding the ability to add webseeds after the fact extends that by being able to adapt in the case that the dataset moves, as well as opens up new opportunities for labor division, where scouts aware of copies of data being held in other places e.g. by other archival groups can do the curation work of adding those to the swarm.

TODO

There are a number of obvious expansion points we’ll be pursuing

The multiple-use of the name field for display and as part of the url in a webseed is a big pain in the ass. this is the reason that, e.g. when downloading the TIF archive of the NMAAHC, one ends up with a million torrents with the same name.¹³ This also poses a problem when the relevant files are not all stored under one subdirectory, or have urls for files with any other naming structure but that of the torrent. We want to create a plugin for sciop that allows you to use a sciop URL as a webseed-by-proxy that then redirects to the appropriate URL on the webseed server, and in the future we’ll be working on our own client that escapes some of the stagnation of current clients and decouples the name field.
When a webseed is added on the server, it is not added to clients that have already downloaded the torrent. We’ll be writing a general sync function for sciop-cli that resolves this and other torrent mutability problems at the client level, updating existing torrents with new metadata, replacing torrents that have been superceded by a repack, and so on.
Validating files in a torrent against a webseed url is almost the same operation as creating torrents from a url. We want to support that for, e.g., using spare server resources to create torrents when scraping resources are thin, as well as for “importing” datasets from other archival systems. This also opens up interesting possibilities for hybrid http/bittorrent-backed web archives for, say, replaywebpage from webrecorder, but more information on that is TBD
We’ll also be adding stats to account pages to encourage adding webseeds, it’s good to be able to brag about doing good things!

If you’d like to get involved with these or any other sciop projects, you are more than welcome to hop on the issues, or contact me or anyone else in SRC to ask to be added to our contributor chat <3

Appendix on Malicious Use

What about privacy? isn’t making people ping a server a whole attack vector?

Adding arbitrary webseeds is no more of an information leak than being able to add arbitrary peers to a swarm. Anyone who wanted to surveil the swarm could do so far more easily by simply joining it or scraping trackers.

Sciop validates a random subset of pieces from any added webseeds before adding them into torrents, and since the selected pieces are random, it would be relatively hard to fake hosting some tiny subset of the data and yield zero reward. If a webseed were to try and serve copies of the data spiked with malware, torrent clients would reject it since it doesn’t match the piece hash and ban the webseed IP.

However if there is some security or privacy risk that we failed to consider, please submit an issue or contact us privately at contact (at) safeguar (dot) de if raising an issue could compromise sciop’s security.

Actually not today, for a week or so, but it was broken and we were letting shrimp have the first dibs. ↩
“interacting with torrent files but not a total nightmare.” ↩
It’s possible to just directly edit bencoded objects, but we don’t make a habit of inviting hell into our minds by passing around anonymous dicts. ↩
More to come, including embedding a first-party tracker, description metadata in the usual places, and json-ld in unusual places. ↩
although while i am writing this i am realizing we need to also display it on sciop, so by the time you read this that will likely already be done. ↩
And via the api ↩
See the docs for more information on the service and the configuration ↩
In our experience, it is typically the incoming connections that are blocked, rather than outgoing, and it is possible to download a torrent with a client or a tool like aria2 even when seeding is impossible. ↩
S3 is just HTTP!!! All data hosted in buckets is now in play. E.g. if your AWS S3 bucket was smithsonian-open-access, the HTTP url is just https://smithsonian-open-access.s3.amazonaws.com/ and all the paths, or prefixes, whatever they call them, are the same. ↩
or the fediverse, as is more often the case. ↩
e.g. if they are subscribed to the relevant torrent .rss feed ↩
Once for the initial hashing, once for initial seeding to the swarm ↩
~ user experience ~ ↩

Welcome to SciOp The Blog!

2025-08-26T00:23:39+00:00

What up everyone, this is the sciop blog. On this website we’ll post updates about the other website, https://sciop.net

This post isn’t really about anything except the blog existing, so you should probably go to a different one