|
1 |
| -# stackexchange-xml-to-csv |
| 1 | +# stackexchange-xml-to-csv |
| 2 | + |
| 3 | +**stackexchange-xml-to-csv** is a CLI tool that allows you to convert [Stack Exchange data dumps](https://archive.org/download/stackexchange) from `XML` to `CSV` format, which is more suitable for importing to the different databases. |
| 4 | + |
| 5 | +Table of contents. |
| 6 | +================= |
| 7 | +* [Getting started](#get_start) |
| 8 | + * [Download database dump](#download-dump) |
| 9 | + * [Extract archive(s)](#extract) |
| 10 | + * [stackexchange-xml-to-csv building](#stackexchange-xml-to-csv-build) |
| 11 | + * [XML to CSV Convertation](#xml-to-csv) |
| 12 | +* [RDBMS schema examples](#examples) |
| 13 | + * [PostgreSQL](#pg) |
| 14 | + * [MySQL](#mysql) |
| 15 | +* [License](#license) |
| 16 | + |
| 17 | + |
| 18 | +Getting started. |
| 19 | +================ |
| 20 | +Before, ensure that you have a working [Go environment](https://golang.org/doc/install) with go version >= 1.14. Execute in the console `go version` command. It should display the current version of the compiler. |
| 21 | + |
| 22 | + |
| 23 | +1. Download database dump. |
| 24 | +========================== |
| 25 | + |
| 26 | +Choose and download the [database dump](https://archive.org/download/stackexchange) that you are going to convert. |
| 27 | + |
| 28 | +**Important: Stackoverflow dump stored in 8 separated 7z archives:** |
| 29 | + |
| 30 | +* [stackoverflow.com-Badges.7z](https://archive.org/download/stackexchange/stackoverflow.com-Badges.7z) |
| 31 | +* [stackoverflow.com-Comments.7z](https://archive.org/download/stackexchange/stackoverflow.com-Comments.7z) |
| 32 | +* [stackoverflow.com-PostHistory.7z](https://archive.org/download/stackexchange/stackoverflow.com-PostHistory.7z) |
| 33 | +* [stackoverflow.com-PostLinks.7z](https://archive.org/download/stackexchange/stackoverflow.com-PostLinks.7z) |
| 34 | +* [stackoverflow.com-Posts.7z](https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z) |
| 35 | +* [stackoverflow.com-Tags.7z](https://archive.org/download/stackexchange/stackoverflow.com-Tags.7z) |
| 36 | +* [stackoverflow.com-Users.7z](https://archive.org/download/stackexchange/stackoverflow.com-Users.7z) |
| 37 | +* [stackoverflow.com-Votes.7z](https://archive.org/download/stackexchange/stackoverflow.com-Votes.7z) |
| 38 | + |
| 39 | +## 2. Extract archive(s). |
| 40 | +======================== |
| 41 | + |
| 42 | +Extract archive(s) content file(s) to the directory from where you will convert files using [7z](https://www.7-zip.org/) or another archiver. |
| 43 | + |
| 44 | +Example with with [academia.stackexchange.com.7z](https://archive.org/download/stackexchange/academia.stackexchange.com.7z) dump: |
| 45 | +```shell |
| 46 | +$ mkdir xml csv |
| 47 | +$ 7z e academia.stackexchange.com.7z -oxml |
| 48 | +$ ls xml/ |
| 49 | +Badges.xml Comments.xml PostHistory.xml PostLinks.xml Posts.xml Tags.xml Users.xml Votes.xml |
| 50 | +``` |
| 51 | + |
| 52 | +## 3. stackexchange-xml-to-csv building. |
| 53 | +=========================================== |
| 54 | + |
| 55 | +Clone & build `stackexchange-xml-to-csv` converter: |
| 56 | + |
| 57 | +```shell |
| 58 | +$ git clone https://github.com/SkobelevIgor/stackexchange-xml-to-csv |
| 59 | +$ cd stackexchange-xml-to-csv/ |
| 60 | +$ go build |
| 61 | +``` |
| 62 | + |
| 63 | +## 4. XML to CSV Convertation. |
| 64 | +============================= |
| 65 | + |
| 66 | +Now you have `stackexchange-xml-to-csv` executable file. Let’s convert XML files: |
| 67 | +``` |
| 68 | +./stackexchange-xml-to-csv -—source-path=../xml --store-to-dir=../csv |
| 69 | +``` |
| 70 | +### List of possible flags: |
| 71 | + |
| 72 | +* `source-path` (**Required**) Absolute or relative path to the directory with an XML file(s) or to the separate XML file. |
| 73 | +* `store-to-dir` (**Optional**) Absolute or relative path to the directory where to store result CSV files. |
| 74 | +* `skip-html-decoding` (**Optional**) Some of the files (e.g., Posts.xml) contain escaped HTML. By default, the converter will decode them. To disable this behavior, use this flag. |
| 75 | + |
| 76 | + |
| 77 | +Schema examples. |
| 78 | +================ |
| 79 | +Here you can find examples of the schema for different databases: |
| 80 | + * [PostgreSQL](example/postgres_ddl.sql) |
| 81 | + |
| 82 | + |
| 83 | +License |
| 84 | +======= |
| 85 | + |
| 86 | +[MIT License](LICENSE) |
0 commit comments