Skip to main content

corpkit: a tool for investigating text | corpkit

For the computational linguists amongst us, what appears to be quite a sophisticated package.


September 29th recap

6 min read


The agenda was at ; future agendas will also appear there.

Putting the agenda on github also allows you to fork a copy for your own records. Once you have a copy, you can also hit the pencil icon on the github webpage for your copy, and write your notes directly in there. The ambitious individual could even compile a collective record of our discussions this way...

I spent the better part of the session talking about my own tortured path to this point - how I became a 'dh' person (for a whatever 'dh' might mean), some critical failures in my teaching to date that have shaped my perspective on what it means to be a teacher and scholar, and a bit about how this course works (especially the idea of the 'dh primer', both a guide for your peers to the world of DH, but also, an advertisement of your own developing sense of what 'dh' should mean in your work. Dh is as DH does, said Forrest Gump).

I also spoke about the point of the and how this will be both an exercise in thinking through how to communicate DH to your non-dh peers, but also a way for you to think through what is important about DH for your own research, a critical and informed piece of meta-thinking about the field. Since this will also live online (like much that we will do; but remember the class policies about that and please talk to me if you have concerns there), it also functions as a kind of weight in the world around which your online digital identity might colesce.

I then waxed lyrical about git, github, markdown, and the virtues of separating form from content. I also had a conversation afterwards with one of you, and I'll reproduce the gist of some of it below:

Why Git, Github, and Markdown?

One of you: "WHY is this important?  WHY is this part of what is called DH? Although I under the general idea is that it makes research better, keeping up with our Digitalized world... but I feel lost at the moment."

Me: "No, those are good questions! And right there, you’re asking DH questions - why these tools? What’s wrong with Word? Be critical of the tool, why you're using it, what it does, the assumptions about how the world works that are built right in.

Turn those questions on their head: what does Word do to your research? The thing with word, excel, and the rest of the microsoft & apple product line - even more so now than in the past - is that they are trying to lock you into their ecosystem.

So the whole schtick with learning markdown (which is just a plain text file with one or two things like asterisks on either side of a word to indicate bolding, and so on) iss that it is as simple a file as you can create: able to be read by any computer or hand held device - not technological or corporate lock in. The thing with word is that those .docx files are in a proprietary format, and one that conflates ​*what*​ you write with ​*how it looks*​. Separate that, and you've got portability, future-proofing, and translation into webpages or epubs or pdfs or whatever. 

When we separate content (what you’re thinking) from container (word, the typography, the bells & whistles) by putting stuff into plain text files, you can start to do some awesome stuff. For instance:

That is a website that displays a slide presentation. The html that makes slides is separated from the markdown plain text file that actually has my content, my ideas. That second file can be turned into an article, a pdf, a set of handouts, a word document (yes), with a single command to the computer.

So - if you keep stuff as plainly as possible - plain text files with the .md extension, or lists etc as .csv (comma separated values) rather than .xls spreadsheets - your research will always be future proof, able to be deposited in repositories or archives, and accessible!

It allows other people to build on your research more easily - SSHRC for instance is starting to mandate that not just articles but research notes too get made open access.

What’s also fun is since Github understands .md and that # means a level-1 heading, you can get Github to display your files as if it were a website etc."


Meanwhile, on Slack

I also posted some stuff in our Slack channel that I think is worthwhile repeating here:

"There are lots of markdown cheat sheets out there; I also like to play with when I’m writing in markdown because it renders everything on the other side of the screen so I can see if something worked or not.

Also: you can use to write directly into a github repository. You create a new repository in github, (remembering to tick off the box ‘initialize with a file’) then at you authorize it to play with github. Then you have an online text editor that is writing and saving directly into your repository.

When we get to talking more about open access publishing, there are platforms for collaborative writing etc that make use of that exact same functionality, like or "


And Finally

and finally, a few blog posts & videos that will help you with git and github more generally:

and a somewhat more involved piece, but it does include a video going over the same things we did:


See you next week! Remember, for 'community', you can be interacting with each other on Slack or with the wider DH world (if they find us here) on this site.



What Is “Digital Humanities,” and Why Are They Saying Such Terrible Things about It?

Matthew Kirschenbaum unpacks some of the tortured meanings of the phrase.


Topic models: Past, present, and future by O'Reilly Radar | Free Listening on SoundCloud

If you're in to text mining, topic models are extremely important to get one's head around. Easy to misinterpret!


How the Humanities changed the world | OUPblog

"Sadly modern humanists often believe that they are moving towards science when they use an empirical approach in studying texts, art, music, or the past. They are mistaken. Scholars using empirical methods are returning to their roots in the 15th-century studia humanitatis when the empirical approach was invented — and not since disappeared."



Trailblazer – Never get lost.

You know that feeling. You've seen something on a website, somewhere, if only you could remember where it is. Maybe the crucial thing is not *where* you ended up, but how you got there. Trailblazer is a chrome extension that allows you to selectively record your passage across the web, representing it as a kind of force-directed network graph (with hyperlinks!). Now, think what you could do with that, when you build some kind of collaborative web annotation tool into your workflow...

Here's a session I recorded earlier today as I thought about ideas related to the sonification of archaeological data...


New Jack Librarian: Library of Cards

This piece, on the humble history of the index card and various organizational ideas that flow from that (in both analogue and digital form) is well worth a read. What is even better is if you can see the annotations on the page, where the discussion continues. To view this magic, you need a chrome extension called 'Hypothesis'. You can get it here: Then, when that's installed and running, go back to Mita's blog post and click on the icon in your browser bar.

Think what this could do in your teaching, in your research... how do the affordances of 3 x 5 cards, and hypothesis, tie to your own domains?


A Visual Introduction to Machine Learning

Remember this one when we get around to text analysis etc.


Generating Poetry with PoetRNN

Recurrent Neural Networks for generating poetry. Interestingly, one can draw a direct connection all the way back to the I-ching.