10 Database Building
Introduction
Contrary to popular belief, historians DO dabble in numbers, databases, and other math-like things you thought you’d never use again once you passed Math 101. But don’t sweat it. You can use databases to answer some interesting historical questions, and if you take it step-by-step, can actually be fairly painless.
What sort of database questions do historians ask? Here are some examples:
- Which American voters were most likely to cast their ballot for Abraham Lincoln in 1860? Did this change by region? Were urban residents more likely to be Democrats in 1860 than rural ones?
- Did the arrival of public universities to Texas in the 1860s and the increased access to higher education lead to greater job opportunities for Texans in the 1870s and 1880s?
- How was the transatlantic slave trade between Africa and the United States affected when Great Britain outlawed the Atlantic slave trade in 1807?
These possibilities are just the proverbial “tip of the iceberg.” Aggregate data, the kind that comes from censuses, election results, and shipping manifests, often provides the best glimpse into ordinary people’s lives. Databases are especially valuable for insight into people who, for one reason or another, did not leave many first-hand accounts. The Slave Voyages databases https://slavevoyages.org/ is compiled by a team of scholars from thousands of manifests and ships logs from vessels that crisscrossed the Atlantic between the fifteenth and the nineteenth centuries and provides invaluable information on the middle-passage.
The problem with these primary sources, however, is that they are most helpful when they are looked at in the aggregate, rather than in their raw form, reading it as a single primary source like a speech or a set of letters from a famous person. It’s difficult to try to make sense of a census enumerator’s listing of the information from a single street, or to try to draw conclusions about changes in the slave trade by looking at a couple of different slave ship logs. Put another way (to address question number 2 above): we can only know whether Americans were more literate in the 1870s by looking at the census information about literacy for all Texans in two different censuses (1870 and 1880). If we only counted the number of individuals marked literate on a single street, even if we did found the same residents living on the same street for both 1870 and 1880 and learned how many more told the census enumerator they were literate in 1880 but not 1870, we’d only know about literacy for those Texas residents, not whether new universities sparked a statewide trend.
Below is a comparison between the 1870 and 1880 U.S. Census section on illiteracy.
Census records for the U.S. can be found at census.gov
The answer to this problem—of how to answer historical questions about trends among ordinary people—is to create a database that you can feed into powerful analysis programs, such as IBM SPSS Statistics or GIS. In the paragraphs below, you’ll find an explanation of the basics for creating databases. There are also links at the end of the chapter to a more technical look at database construction.
One of these challenges is that the informational content of historical sources must be converted – sometimes in several ways – before it is useful as data; and several methodological choices must be made during this process. There are numerous ‘right’ ways to do this and these will vary based on the specific goal and on the sources used. The ‘modelling’ of historical data is a difficult process, but worth the up-front effort invested in creating a solid, workable design that accurately reflects your data and your project goals.
Sources, Information, and Data
Information may be defined as what the sources provide. Data may be defined as what the database(s) needs. The historian’s challenge is to consider how to transform information into data.
You may encounter unique issues converting sources into a useful database resource that are not encountered by other database users. These problems generally occur as a result of two inevitable realities of historical research:
- At the outset of your research you may do not know precisely what kinds of analysis you will want to undertake with your data
- You will likely be unable to anticipate the full extent and scope of the information contained within your source base
Historical research is often unpredictable; unexpected new lines of inquiry frequently emerge along the way. The more you become familiar with available sources, the more likely these developments are to occur. This makes database design difficult. You should be prepared to make adjustments and changes as you go.
The design of your database will have a direct effect on how useful your data will be. Errors at the design phase can make data entry more laborious and difficult; and, more seriously, such errors will have a significant impact upon the database’s ability to retrieve data. It is essential that the initial design of the database be as ‘correct’ as possible to minimize the need for retroactive restructuring down the line. Even then some redesign is inevitable.
The information historians encounter rarely comes neatly packaged. In fact, sometimes it can be so buried that only a great effort considerable deduction can reveal the hidden information. Often the information is in long narrative form marked with individual quirks. In other cases, it may come in the form of images, sound or video recordings, each of which present unique challenges to database design and manipulation. Even more structured sources like tax rolls, deeds, or census records present certain challenges. Uniformity and regularity, the two defining characteristics of databases, are almost non-existent in the historical record.
Understanding the ‘shape’ of the data is essential to understanding how databases in general work and how your specific database functions. All databases preserve data in tables containing regular, uniform vertical columns and horizontal rows. Your information will need to be entered, and probably modified to some extent, to fit these. Inevitably, you will need to manipulate your information to conform to the structure of the database tables. Some compromise will occur between maintaining the integrity and richness of the original sources and maximizing the analytical power of the database.
Text-based or statistical information at first glance may be most suitable for conversion into a database. Take, for example, our comparison of illiteracy rates in the US between 1870-1880. This source appears ready made to be converted to “data.” The information is already conveniently arranged into columns and rows with each column pertaining to one piece of information (name, age, occupation and so on), and each row corresponding to a single individual. This source should “fit” into the database structure without a need for too much conversion; the nature of the information lends itself to database design.
Example of a ‘rectangular’ historical source
Still, even here some additional work is necessary if the source is to be stored in a database. The tabular structure contains the majority of the information, but the page heading includes important details about the place, time, and identity of the census taker which may be useful. These elements do not conform to a neat rectangle. There is also an overabundance of information most of which is not useful for our research question. It would be helpful to create a new table that has just the information we are looking for.
Conclusion
Although designing a database is usually a time-consuming process, it can ultimately be rewarding in terms of the analytical power it offers. If you have more than a few dozen informational sources and you need aggregate data from these, it is worth spending time in the design and development of a robust database. Remember that no two databases are alike due to differences in sources and needs. The real indicator of success is whether or not it serves your purposes.
Watch for the following as indicators that some design change may be needed.
- Information that you would like to analyze which appears repeatedly, but you have nowhere specific to put it (i.e. for which you will need to add new fields)
- If you find yourself repeating information from record to record, you will need to think about re-ordering your relationships to prevent this.
- Watch out for your datatypes, and change them where they are unhelpful.
- Look for data that could be standardized or classified.
- Look out for information that you had not anticipated when designing the database.
Once you have entered some data – it does not need to be all the sources – develop and try some queries. Working with a small amount of information allows you to know the “correct” answer outside of what the query returns. That permits you to test your query and data design. If you get answers that differ from what you expect, it is easier to trouble-shoot the issue while the database is relatively small in size and scope. The differences that you get may be due to database design or from some logical flaw in the query itself. It is important to isolate and solve these issues while the data is small enough to be manageable.
If you do make changes to the design of your database, consider making copies of the database prior to these changes. Having a backup can be life-saving in case something does not work out as expected.
Database next steps
Below are links to additional information about building and using your database
- Creating a Database, and the different parts of one.
- Database Rules and Datatypes, tips for a good and somewhat bug-free database.
- Database Troubleshooting and Coding
- Working with Multiple Tables