Skip to Content

Import CSV files into Drupal CCK nodes

Posted in

UPDATE : Esteban Acosta Villafañe from Argentina sent me an updated script that handles linking/importing files with your CSV data, and will dynamically determine the mimetype of those files. You can see his modified script here in the blog posting I made when he sent this to me.

UPDATE 2: It looks as though the Node Import module has come a long ways since I last looked at it. It seems that it can now map fields from a CSV file to fields in a CCK node. This may be MUCH easier than using my script below. But my script should work just fine if Node Import doesn't do it for your needs.

One of the projects I'm working on is to migrate a website from an old/outdated content management system I wrote 5+ years ago to something more modern - Drupal One of the challenges here is the listing type pages that provide rows of links to related websites. These links contain the link title, the URL (of course), and a brief description. The alternatives for moving this data was to either copy the rendered table into a single page node, or make use of the Content Construction Kit (CCK) to create custom nodes reflecting the data. Using CCK allows each list entry to be edited directly and actually follows the intended user's way of thinking much better - they DO NOT know HTML at all. So, I opted for the CCK method. There were thousands of links, so spending some time figuring out how to automate the migration was worth while.

Let me say this up front. If you have a similar situation, you WILL encounter hours of research, coding, and frustrations. If you are like me and just need a "one-off" or throw-away type of solution, you might not feel it's worth the effort once you start seeing the details. Continue reading here, and hopefully I can save you some of those hours.

(DISCLAIMER: I'm not claiming to be an expert in Drupal matters, especially not at this level. If you see any glaring errors in my approaches, or know of a better way, please drop me a line.)

The challenge is to step out of the "Drupal mindset" and simply deal with the database tables directly. Almost all the documentation you'll find online for this kind of migration fails at stepping outside the Drupal mindset. They almost all start talking about the Drupal API, and hooks, modules, etc. That's perfectly fine if you are trying to create a reusable solution that takes advantage of the Drupal framework. But when you just want to dump some data into the tables, there's precious little out there (that I've been able to find at least). I've spent the past 3 days working this out, so I'm hoping this post will save someone those three days.

Knowledge you'll need

  • How to work with MySQL (or your Drupal Database system, if you opted for an alternative)
    • making database backups
    • restoring database backups
    • determining available tables and/or the table structures
    • working at the MySQL command line, or using a tool like phpMyAdmin
  • Some comfort working with PHP code
  • How to work with CSV files

Background Info

Your next stop is to examine two links that are somewhat buried on the Drupal site:

Neither link on their own gives you enough details. But together we can start working out a solution.

The first link shows the table structure for the CCK nodes. With version 5.x some of the table names have changed. Instead of "node_content_X", you'll now see "content_type_X" (where X is your custom node type), and instead of "node_data_field_X" you'll see "content_field_X". You'll be working with these tables. The rest of the page dives into working with Drupal API, which we don't really need.

The second link isn't really a How-To in the proper sense of the idea. It's more of a "here's some info to get you started" type of posting. But from here we can start to glean the process we need. The routines they discuss may work for you directly. But even there they are talking about trial and error and inspecting the database after each change to see what you need to code.

I've been through that trial and error already, so here's the routine I've worked out. First and foremost is to backup your database of course. Be prepared to restore it a few times as you work through the routine.

The Routine

  1. Get the User ID (uid) for the Drupal user who will be considered the author of the new records.
  2. Open the CSV file and start looping over each record.
  3. Get the next Node ID (nid) and Version ID (vid) to be used. (sequences table) Yep, Drupal 5.x isn't using AUTO INCREMENT fields. (apparently version 6.x will)
  4. Create a Node record (node table).
  5. Create a Node Revision record (node_revisions table).
  6. Create the CCK specific record. More notes on this below.
  7. Link the node to a Taxonomy term if needed. (term_node table)
  8. Create a Node Comment Statistic record. (node_comment_statistics table)
  9. Update the Node ID and Version ID sequence numbers.
  10. Do the next record.

Point Discussion
1. If you are really after a throw away routine, use the SQL command "select uid from users where name = 'bob';" to get your user id, then hard code that into your script. In my case I had multiple node types to do, so opted to pass the username as an argument to my script and had it find the uid.

2. The nid and vid values are THE link between your node and anything related to it. The last ID used is stored in the sequences table - so we'll add one to these IDs to get the next ID for our node. In particular we are after the "id" field value where the "name" field has "node_nid" and "node_revisions_vid" values. So we'll have to query the sequences table twice, or get fancy with some SQL or code to get the two distinct records.

3. Adding a record in the node table tells Drupal we have something there. The details of that something are stored elsewhere... (we're getting to it)

4. We also need a node revision record. Drupal does a join between the node and node_revisions tables, and if the join fails we will get a "Page Not Found" message when we try to view or edit the node. (trust me, it took me hours to find that my revision records weren't getting created...). One note here - with CCK, you may have opted to NOT use the BODY field. In this case the "body" and "teaser" fields may not be applicable. However these fields do NOT allow null values. But they WILL allow an empty string. The same applies to the "log" field.

5. Create the CCK record. This is either straight forward, or it's not. But that depends on your CCK node definition. If your CCK node has fields that are NOT reused and do NOT contain multiple values, then you can plop your data directly into the corresponding field in the "content_type_YOUR_NODE" table. If your fields DO contain multiple values, or have been reused in other CCK nodes, then your corresponding field data will need to be stored in the "content_field_YOUR_FIELD" table. In my case, I had simple data sets and had purposely created the nodes to not reuse fields. So I didn't have to worry about the "content_field_X" tables. Regardless, you will need to examine these tables and map the field names to the corresponding CSV column.

6. If you have a required taxonomy field for your CCK node, you'll probably want to assign a suitable "term ID" to the node. This is done in the term_node table, and it contains a tid and nid field. The quickest way to find the correct term ID is to use the GUI to edit the term, and check the term ID that shows up in the URL. Or just grab it from the term_data table - unless you have hierarchies with similar terms...

7. Comment statistics are kinda optional. But as I understand things you are better off to put a record in here than not doing so. If the comment module is enabled, the record is needed. This is like the others - fill in the fields with the pertinent data.

8. Finally, make sure the sequences table is updated with the IDs we just used.

Phew. That's a lot of work. Luckily it's very repetitive stuff, which computers excel at. I created a script to do this for me, and I'm sharing it here in the hopes that someone can make use of it. (and to maintain a copy so I don't need to recreate it again the next time I have to do this...:)

The Script
The script is a PHP command line script. It *should* run with little changes under Linux and Windows, and PHP 5+. I've made use of the PEAR MDB2 database abstraction layer, so you'll need that as well.

The script is a quick and dirty solution. So it doesn't have the usual "best practices" you'd usually see in a more formal application. I've documented it, but the above procedure is what it is doing. So if you've read that, the code should be straight forward.

The script requires a little bit of preparation.

First, backup your database!

Make sure the file is executable. On Linux you can do "chmod 755 import_csv". On Windows, you'll likely need to remove the first line and add ".bat" onto the file name.

Next, an include file needs to be created. This include file is expected to be named "nodetype.inc" - where you replace the "nodetype" bit with your specific node type. The file contains two arrays. The $fields array contains the field names to use for each column of the CSV file. These field names come from the content_type_YOURNODE table. So if my CSV file contained columns in the order of "B", "C", "A", then my $fields array would reflect this order.

The second array $fieldmap is used to map the pertinent field to the "title", "body", and "teaser" fields in the node_revisions table. What is expected here is an ordinal value of the field to use for these fields (zero based). If the body or teaser are to be ignored, place a -1 in these array elements.

The include file was used so that I could prep the three node types I needed individually and wouldn't need to rewrite the main script for each node type. (other than the taxonomy term)

The final prepatory step is the taxonomy term we'll use for the new records. Once you know the term ID to use (see longer description above), place this in the "mapToTerm()" function call. Currently on line 87 of the script. If you don't want a taxonomy term assigned, comment out the "mapToTerm()" function call.

Now with the include file, and the taxonomy term set, we can call our script. It takes the following form:

./import_csv type=NODETYPE csv=FILE db=DBURL user=USERNAME
  • NODETYPE - this is the node type to be imported. It should match the text after the "content_type_" part of the "content_type_YOURNODE" table name. This NODETYPE is also the include file that is searched for. For instance if your node type was "product_info" there will need to be a corresponding "product_info.inc" file in the script directory.
  • FILE - the CSV file to be used
  • DBURL - the database URL used to talk to your Drupal database. You can find this in the /sites/default/settings.php file in your Drupal directory. It should look something like "mysql://username:password@localhost/databaseName".
  • USERNAME - the Drupal username to assign as the author of the new nodes

If all goes well, this will wiz through your CSV file and create new nodes for you. The script uses the first column of the CSV file in it's progress reporting, to help you track down any possible issues with the data.

Conclusion
Some potential bad news. Drupal version 6.x is nearing it's release date. This script may or may not be pertinent at that point - it might not work, or might need to be completely revised to stay current.

All told, three days to create the script and get it working properly. Or I could have spent a week entering data manually. I'd rather be writing code than doing data entry...